Upload
chang-wen
View
214
Download
1
Embed Size (px)
Citation preview
MINING VISUALNESS
Zheng Xu1∗, Xin-Jing Wang2, Chang Wen Chen3
1University of Science and Technology of China, Hefei, 230027, P.R. China2Microsoft Research Asia, Beijing, 100080, P.R. China
3State University of New York at Buffalo, Buffalo, NY, 14260, U.S.A.
[email protected], [email protected], [email protected]
ABSTRACT
To understand which concepts are visualizable and to what
extent they can be visualized are worthwhile for multime-
dia and computer vision research. Unfortunately, few pre-
vious works have ever touched such topics. In this paper,
we propose an unified model to automatically identify visual
concepts and estimate their visual characteristics, or visual-
ness, from a large-scale image dataset. To this end, an image
heterogeneous graph is first built to integrate various visual
features, and then a simultaneous ranking and clustering al-
gorithm is introduced to generate visually and semantically
compact image clusters, named visualsets. Based on the vi-
sualsets, visualizable concepts are discovered and their visu-
alness scores are estimated. The experimental results demon-
strate the effectiveness of the proposed schema.
Index Terms— visualness, visualsets, image heteroge-
neous graph, ranking, clustering
1. INTRODUCTION
Despite decades of successful research on multimedia and
computer vision, the semantic gap between low-level visu-
al features and high-level semantic concepts remains a prob-
lem. Instead of generating more powerful features or learning
more intelligent models, researchers have started to investi-
gate which concepts can be more easily modeled by existing
visual features [1, 2, 3].
To understand to what extent a concept has visual char-
acteristics, i.e. “visualness”, has many values. For instance,
it can benefit recent research efforts on constructing image
databases [4, 5]. These efforts generally attempt to attach im-
ages onto pre-defined lexical ontology, while existing ontol-
ogy were built without taking visual characteristics into con-
sideration. Knowing which concepts are more likely to find
relevant images will help save labors and control noises in
database construction. Visualness estimation is also useful
for image-to-text [2, 6] and text-to-image [7] translation, e.g.,
words of more visualizable concepts are potentially better an-
notations for an image.
∗This work was performed at Microsoft Research Asia.
Cordless phone 5.01
Egypt Sphinx 4.06
Offshore oilfield 4.00
Eagle 1.90
Pink 0.75
Fig. 1. Examples of simple concepts (left) and compound
concepts (right) with estimated visualness.
Albeit the usefulness, a general solution of visualness es-
timation faces many challenges: 1) It is unknown which con-
cepts or which types of concepts are visualizable, i.e. whether
representative images can be found to visualize its semantics.
For instance, “dignity” and “fragrant” are both abstract noun-
s, but the former is more difficult to visualize as “fragrant” is
closely related to visual concepts such as flowers and fruits;
2) Different visual concepts have diverse visual compactness
and consistency, especially for collective nouns (e.g., “ani-
mal”) and ambiguous concepts (e.g., “apple”, which may rep-
resent a kind of fruit or a company); and 3) Even though a
concept is highly visualizable, it may still be difficult to cap-
ture the visual characteristics due to the semantic gap, e.g.,
“tiny teacup chihuahua”.
Few previous works in the literature have touched this re-
search topic. To the best of our knowledge, Yanai et al. [1]
are the first to propose visualness of a concept, who estimate
visualness of adjectives with image region entropy. Zhu et
al. [7] measure “picturablility” of a keyword by the ratio of
the number of images to that of webpages retrieved by com-
mercial search engines. Lu et al. [2] attempt to identify con-
cept lexicon with small semantic gaps by clustering images
based on visual and textual features, and then extracting the
most frequent keywords occurred in image clusters. Berg et
al. [8] utilize SVM classifiers to measure visualness of at-
Web images
Ocean
Blue
Sea
Nature
Alfaromeo
Cloverleaf
Park
Auto
Car
Canon
Ferrari
Voiture
Flower
Beautiful
Sky
Cloud
Asia
Green
Lily
Beetle
Pest
Auto
Verde
Green
Beatle
Car
Highway
Overpass
Onramp
Cloverleaf
Old
Blue
Classic
Car
Macro
Shamrock
Nature
Kleeblatt
Cloverleaf
Flowers
Red
Macro
Rose
Africa
Sky
Nature
Clouds
Ocean
…
a) Building image heterogeneous graph
global feature
…h1
…
…
…
…
…
h2
h3
g1
g2
g3
local feature
…
…
h1
h2
h3
g1
g2
g3
w1
w2
w3
w4
w5
Green
Red
Car
Flower
Ocean
Sky
Blue
Cloverleaf
1) attribute nodes 2) image heterogeneous graph
b) Mining visualsets
1) image heterogeneous subgraph
…
h1
h2
g1
g3
w1
w2
w3Green
CarFlower
Sky
Cloverleaf
…
3) posterior representation
iii) cluster
refinement
…
0.23 0.17
Car 0.31
Red 0.19
Auto 0.05
…
0.31 0.12
…
Red 0.25
Flower 0.25
Sky 0.01
…
Ocean 0.23
Blue 0.22
Sea 0.21
…0.25 0.24
…
Green 0.24
Nature 0.21
…
…
0.17 0.09
2) visualsets Visualness
Red flower 0.63
Car 0.42
Cloverleaf 0.22
…
Ocean 0.39
Green flower 0.17
Fig. 2. The proposed framework of visualness mining: a) build image heterogenous graph from web images with multi-type
features; b) mine visualsets from image heterogenous graph by iteratively performing i) ranking nodes ii) posterior estimation
and iii) cluster refinement; c) estimate visualness with visualsets.
tributes. Sun et al. [9] investigate if a tag well represents its
annotated images in the social tagging scenario, by leverag-
ing visual compactness of the annotated images and their vi-
sual distinctness to the entire image database. Jeong et al. [3]
quantify visualness of complex concepts of the form “adjec-
tive(attribute)+noun(concept)”by integrating intra-cluster pu-
rity, inter-cluster purity, and entropy of retrieved images from
commercial search engines.
There are two major disadvantages of previous works: 1)
concept lists are pre-defined, and 2) concepts are insufficient-
ly assumed prototypical. Two notable works are reported in
[2] and [3]. Lu et al. [2] mine the concepts of small semantic
gaps automatically from an image dataset, but assume each
concept prototypical. In contrast, Jeong et al. [3] focus on
“complex concepts” that imply reliable prototypical concepts,
but the concept vocabulary is pre-defined. In fact, many con-
cepts are semantically ambiguous (e.g., “apple” as a fruit or
a company) or have multiple senses (e.g., “snake” as a noun
meaning the reptile snake or a verb suggesting a snake-like
pattern).
In this paper, we attempt to discover and quantify the vi-
sualness of concepts automatically from a large-scale dataset.
The quantitative measure of a concept is based on visual and
semantic synsets (we call “visualsets”), rather than a single
image cluster or keyword as in previous works. Visualset-
s perform disambiguation on the semantics of a concept and
ensures visual compactness and consistency, which is inspired
by synsets in the work of ImageNet[4] and Visual Synsets[6].
In our approach, a visualset is a group of visually similar
images (we call “member images”) and related words, both
are scored by their membership probabilities. Visualsets con-
tain prototypical visual cues as well as prototypical semantic
concepts. Given the visualsets, the visualness of a concept
is thus modeled as a mixture distribution on its correspond-
ing visualsets. Moreover, we discover both simple concepts
(keywords) and compound concepts (combination of unique
keywords) simultaneously from the generated visualsets, see
Fig. 1.
2. VISUALNESS MINING
In this section, we describe our approach of quantifying the
visualness of concepts.
2.1. The Framework
Our approach contains three steps: 1) build an image het-
erogenous graph with attribute nodes generated from multi-
type features; 2) mine visualsets from the heterogeneous
graph with an iterative clustering-ranking algorithm; and 3)
estimate visualness of concepts with visualsets. Fig. 2 illus-
trate the entire process.
Given a (noisily) tagged image dataset such as a web im-
age collection, we connect the images into a graph to facili-
tate the clustering approach for visualsets mining, as shown
in Fig. 2(a). Specifically, we extract multiple types of visual
features and textual feature for images to generate attribute
nodes (see Fig. 2(a1)). The edges of the graph are defined
by links between images and attribute nodes (see Fig. 2(a2))
instead of image similarities which are generally adopted in
previous works [2]. That is, images are implicitly connected
if they share some attribute nodes.
Then, an iterative ranking-clustering approach is applied
to form visual and textual synsets, i.e. visualsets, as shown in
Fig. 2(b). In each iteration, we start with the guess on image
clusters (see Fig. 2(b1)). Based on the guess, we score and
rank each image as well as attribute nodes in each visualset
(Fig. 2(b2)). Images are then mapped to the feature space
defined by the visualsets mixture model (Fig. 2(b3)). Clusters
are refined based on the estimated posteriors, which gives the
guess on image clusters for the next iteration.
After the clustering-ranking approach converges, we esti-
mate the visualness of concepts (simple and compound) from
the visualsets based on final scores of images and attribute
nodes, as shown in Fig. 2(c).
2.2. Building An Image Heterogeneous Graph
Given images X =
x1, x2, . . . , x|X |
, visual features GIST,
color histogram and SIFT are extracted for each image. We
(a) Color histogram clusters (b) GIST clusters
Fig. 3. Examples of image clusters generated by different vi-
sual features. They suggest how different visual features favor
distinct visual properties. Color histogram captures intensity
and contrast, while GIST prefers structure and edge.
Color
Image
GIST
Visual
words
Tags
i
m
j
k
n
Set-based
attribute nodes
Word-based
attribute nodes
Fig. 4. A heterogeneous sub-graph centered on one image.
Each image has only one set-based attribute node for each
type of global features (h, g), and multiple word-based at-
tribute nodes from local features or textual features (w, t).
exploit multiple types of visual features as they capture differ-
ent perspectives of visual properties which compensate each
other (Fig. 3). Our proposed visualness mining method is in-
dependent of the visual features selected.
We do k-means clustering on the three types of visual fea-
tures respectively, and treat the cluster centers as the attribute
nodes. We represent each image with one attribute node from
GIST and color histogram respectively (so-called set-based
attribute nodes), and a bag of words on SIFT and texts respec-
tively (so-called word-based attribute nodes). Fig. 4 illustrates
the star-structure centered on an image.
Denote the attribute nodes from GIST, color histogram,
SIFT, and texts as G =
g1, g2, . . . , g|G|
, H =
h1, h2, . . . , h|H|
, W =
w1, w2, . . . , w|W|
, T =
t1, t2, . . . , t|T |
respectively. We have the set of attribute
nodes as A = G ∪H ∪W ∪ T .
We define the heterogeneous image graph G = 〈V ,Ω〉 as
a weighted undirected graph , where V = X ∪A are the nodes
and Ω =
ωxinj|xi ∈ X , nj ∈ V
define the edges. We have
ωxinj=
1, nj ∈ G ∪ H, xi ∼ nj
c, nj ∈ W ∪ T , freq(xi, nj) = c
0, otherwise
(1)
where xi ∼ nj denotes a link between xi and nj , and
freq(xi, nj) represents the frequency of (visual) word nj oc-
curs in image xi.
2.3. Mining Visualsets
Algorithm 1 The Visualsets Mining Algorithm
Input: G = 〈V ,Ω〉Output: C
Initialization:
get C(0)k
Kk=1 by randomly partitioning images
Iteration:
for each t ∈ [1, N ] do
Ranking:
compute P(S|C(t)k
), P(A|C(t)k
) by Eq. 2, Eq. 3
compute P(X|C(t)k
) by Eq. 4
Clustering:
estimate p(C(t)k
) and p(C(t)k
|xi) by Eq. 7
refine C(t)k
by Section 2.3.3
end for
return C = C(N)k
|k = 1, . . . ,K
Denote the set of K visualsets as C =Ck|k = 1, . . . ,K. Topologically Ck ⊂ G, which is
a subgraph defined on the member images of Ck and related
attribute nodes. Each node of Ck is associated with a score
indicating its importance to Ck. Images in Ck are assumed
to be consistent in visual appearance, while texts in Ck form
a synset which has disambiguated semantics, and together
they construct a visually and semantically compact and
disambiguated visualset. C defines a generative mixture
model for images on which visualness can be measured.
We adopt the ranking-clustering framework [10] to gen-
erate C. The framework provides a method can simultane-
ously partition a heterogeneous graph and weight nodes in
each cluster, hence is suitable for visualsets mining. Mean-
while, addressing authority of nodes by ranking during image
clustering can help handle the noisy web data. We choose a
representative image instead of the mean as centers in cluster
refinement to further facilitate the process.
Algorithm 1 outlines our method. We initialize the al-
gorithm with a random partition on X and assign a uniform
score to image xi ∈ C(0)k . We propagate the scores via the
edges Ω of the graph G to rank attribute node nj ∈ A, which
in turn updates the scores of xi in each visualset Ck respec-
tively. This is the ranking step. Images are then mapped to
the measurement space defined by the mixture model of vi-
sualsets. The partition of X is then refined according to pos-
teriors. This is the clustering step. The process iterates until
convergence, which gives final visualsets C. We detail the
iteration process below.
2.3.1. Ranking nodes
Denote XCk as the member images of Ck and P(XCk) as the
authority of member images in Ck. Denote set-based attribute
nodes as S ∈ G,H and word-based attribute nodes as A ∈W , T . Let P(S|Ck) and P(A|Ck) be the authority of the
elements of S and A in Ck. Let WSCkand WACk
be the
weight matrices of edges between attribute nodes S,A and
member images XCk respectively. We have WSCk= WT
CkS
and WACk= WT
CkA.
Assuming that different types of attribute nodes are inde-
pendent, we first compute the authority scores of each type
of attribute nodes respectively. We adopt the HITS [11] like
iterative algorithm on the bipartite graph between each type
of word-based attribute nodes A and XCkto get P(A|Ck) as
Eq. 2. We introduce smooth λA to P(A|Ck) to tolerate errors
in the clustering for the generation of attribute nodes.
P(XCk) =
WCkA· P(A|Ck)
‖WCkA· P(A|Ck)‖1
P(A|Ck) = (1 − λA)WACk
· P(XCk)
‖WACk· P(XCk
)‖1+ λA
1
|A|
(2)
On the other hand, authority P(S|Ck) of each type of set-
based attribute nodes S is estimated with the help of word-
based attribute nodes A, bridging by XCk. We involve A to
rank S as word-based attribute nodes are either semantic key-
words or visual words that indicate semantically and visually
similarity. The scoring function is as Eq. 3 below.
P(A|Ck) =WACk
·WCkS· P(S|Ck)
‖WACk·WCkS
· P(S|Ck)‖1
P(S|Ck) = (1− λS)WSCk
·WCkA· P(A|Ck)
‖WSCk·WCkA
· P(A|Ck)‖1+ λS
1
|S|(3)
We then estimate the authority of each image xi ∈ X in
Ck, i.e. p(xi|Ck). Since each type of attribute nodes Q ∈G,H,W , T are assumed independent, we have
p(xi|Ck) =1
Z
∏
Q
p(xi|Q,Ck) (4)
where
p(xi|Q,Ck) =
∑
nj∈Q
ωxinjp(nj |Ck)
∑
nj∈Q
ωxinj
(5)
and Z =∑
xi∈X
p(xi|Ck) is the normalization factor.
2.3.2. Posterior estimation
We then estimate the posterior p(Ck|xi), Ck ∈ C, xi ∈ X ,
based on which image clusters are refined. We use the EM al-
gorithm to maximize the likelihood (Eq. 6) to generate images
from the mixture model built on visualsets.
logL =
|X|∑
i=1
log p(xi) =X∑
i=1
log(K∑
k=1
p(xi|Ck)p(Ck)) (6)
Introducing p(Ck) as hidden variables, p(Ck|xi) and p(Ck)are estimated iteratively as
p(Ck |xi) =p(xi|Ck)p(Ck)
K∑
k=1p(xi|Ck)p(Ck)
p(Ck) =1
|X |·|X|∑
i=1p(Ck |xi)
(7)
2.3.3. Cluster refinement
After posterior estimation, each image xi is rep-
resented as a K-dimensional vector p(xi) =(p(C1|xi), p(C2|xi), . . . , p(CK |xi)). The cluster refine-
ment is to reassign xi to the nearest visualset as member
images. The distance of xi against visualset Ck is measured
with cluster center xCkby
dist(p(xi), p(xCk)) = 1−
p′(xi) · p(xCk)
‖p(xi)‖2 · ‖p(xCk)‖2
(8)
where xCkis the most representative image of Ck chosen ac-
cording to posterior by Eq. 9.
xCk= arg max
x∈XCk
p(Ck|x) (9)
We then have xi ∈ Ck, if ∀k′ ∈ 1, 2, . . . ,K, k′ 6= k,
dist(
p (xi) , p(xCk))
< dist(
p (xi) , p(xCk′))
.
2.4. Visualness estimation
After visualsets converge, we compute final scores of textual
words T by
P (T |Ck) =WT X · P (X|Ck)
‖WT X · P (X|Ck)‖1(10)
P (T |Ck) measures the importance of words in a semantically
and visually compact visualset.
We then discover concepts and estimate their visualness
in the following way:
We form a concept as t = tlLl=1 , tl ∈ T , L = 1 indi-
cates a simple concept, whereas L > 1 identifies a compound
concept. The associated score of concept t with visualset Ck
is p (t|Ck) = mintl∈t
p(tl|Ck). Visualness V (t) of concept t is
measured on highly relevant visualsets which have p (t|Ck)larger than the threshold τ . Define the indicator function as
I (t, Ck) =
1, p (T|Ck) > τ
0, otherwise(11)
Visualness score V (t) of concept t is a combination of asso-
ciated scores on relevant visualsets:
V (t) =
K∑
k=1p (t|Ck) ·p (Ck) ·I (t|Ck)
K∑
k=1I (t|Ck)
(12)
Visualset ID: 59
Alley 7.79
Street 2.78
Building 2.20
Light 2.15
Italy 1.49
Visualset ID: 91
Canyon 17.59
Antelope 16.25
Arizona 12.21
Page 8.11
Slot 5.51
Visualset ID: 406
Butterfly 17.43
Flower 6.78
Jesters 5.26
Nature 4.73
Mariposa 3.84
Visualset ID: 530
Ceiling 12.77
Glass 12.63
British 10.59
Museum 10.45
London 10.45
Visualset ID: 886
Red 5.40
Orange 2.20
Bravo 1.34
Light 1.04
Wall 0.95
Visualset ID: 1120
Foals 22.65
Horses 17.41
Horse 5.97
Pony 3.81
Small 3.73
Visualset ID: 1223
Giant 10.67
Panda 10.46
Zoo 10.44
California 10.04
Bear 9.59
Visualset ID: 1338
Gymnastics 85.40
Senior 4.76
Sports 0.87
Gymnast 0.86
Floor 0.43
Visualset ID: 1404
Herd 12.91
Horse 12.35
Gaucho 11.59
Criollo 11.41
Argentina 10.69
Fig. 5. Examples of visualsets. Top ranked images and terms are shown.
3. EXPERIMENT
3.1. Dataset
The NUS-WIDE dataset [12] containing 269,648 images and
5,018 unique tag words from Flickr is used in our experi-
ment. Two types of global features, 64-D color histogram
and 512-D GIST, are extracted. Each type of global features
is further clustered into 2000 clusters by k-means clustering,
whose centers form the set-based attribute nodes of the image
heterogeneous graph. Local SIFT features are also extracted
and clustered into 2000 visual words by k-means clustering,
based on which word-based attribute nodes are generated.
3.2. Visualsets
Firstly, we evaluate the performance of our visualsets gen-
eration approach presented in Section 2.3. By setting visu-
alsets number K = 3000, smooth parameter λS = 0.3 for
set-based attribute nodes, λA = 0.1 for word-based attribute
nodes, which are optimal parameters via experiments, visu-
alsets mining finishes in less than 30 iterations.
Fig. 5 shows some randomly selected visualsets. In each
visualset, top ranked images and top ranked words are dis-
played. Generally, images in a visualset are visually compact
and consistent, and are semantically related to the top ranked
words. Meanwhile, the associated scores of words illustrate
the effectiveness of visualsets. For example, the five word-
s of visualset 530 are highly relevant to the member images
and have close scores. However, there is a big jump on scores
of the top two words in visualset 1338 as apparently “Gym-
nastics” is much more relevant to the images than “senior”.
Moreover, visualsets 1120 and 1404 suggest to learn synsets
is valuable. Both visualsets are about horses, but 1120 focus-
es on foals whereas 1404 is about a herd of horses.
Visualsets also reveal relationships between concepts. For
example, visualset 59 may mean “Alley” is a narrow “street”
between “buildings”. The visualset 91 turns out to suggest
“antelope canyon” in “Arizona”. Visualset 406, on the oth-
er hand, implies frequent co-occurrences between “butterfly”
and “flower”, while visualset 1223 has very close scores for
“giant panda”, which suggests a compound concept of “giant
panda” for the panda images.
Saw: 34.71
War: 1.66
Army: 1.03
Tollbooth: 30.66
Glasgow: 6.94
Clock: 6.53
Tollbooth: 40.15
Road: 8.20
Indiana: 1.99
Pavilions: 17.56Architecture:11.45
Italy: 11.08
Pavilions: 17.05
China: 6.70
Temple: 2.25
Ladybug: 8.33
Pattern: 8.33
Texture: 8.33
Architecture: 6.10
Building: 5.81
Asian: 3.83
Cheerleader:7.31
Girls: 7.31
Asian: 7.31
concept
Saw
Tollbooth
Foal
Cordless
Cheerleader
Cellphone
Firefighter
Tennis
Donkey
Pavilion
Telephone
Beard
Gymnastics
Foxhole
Camel
…
Asian
October
Colorful
Ladybug
Available
visualness
11.01
9.18
8.61
8.28
8.19
8.06
7.57
6.90
6.83
6.40
6.33
6.30
6.03
5.88
5.77
3.10E-2
2.64E-2
2.53E-2
4.89E-2
1.01E-4
Fig. 6. Visualness of simple concepts.
3.3. Visualness
Now we evaluate the effectiveness on discovering simple con-
cepts and compound concepts as well as estimating their vi-
sualness. We set threshold τ = 0.04, and then have all visu-
alness scores normalized.
3.3.1. Simple concepts
We find that 2,424 simple concept out of the 5,018 unique tag
words in NUS-WIDE are visualizable, i.e., visualness scores
exceed τ = 0.04. Fig. 6 illustrates the top and bottom simple
concepts, their visualness scores, and top images and word-
s from the corresponding visualsets. A concept may relate
to multiple visualsets, e.g. the two visualsets of “tollbooth”
show the semantics in daytime and at night respectively.
From Fig. 6, we can see that concepts of more compact vi-
sualsets generally have higher visualness scores, e.g. “Asian”
has very low visualness scores because its relevant visualsets
(we show only two due to space limit) are quite different in
concept
Camel
Goat
Moose
Rabbit
Bear
Giraffe
Lion
Antelope
Animal
visualness
5.77
4.02
4.02
3.58
3.35
3.31
3.12
2.63
2.59
Moose: 32.42
Alaska: 25.65
Animal: 7.79
Antelope: 10.04
Zoo: 5.06
Nature: 4.75
Canyon: 14.98
Antelope: 14.67
Arizona: 12.49
Giraffe: 26.82
Zoo: 12.21
Animal: 6.58
Fox: 14.70
Wildlife: 6.95
Animal: 6.11
…
Fig. 7. Visualness of animals.
concept
London graffiti
Cordless phone
Wedding ceremony
Wedding bride
Egypt sphinx
Offshore oilfield
Sky flags
Car pavilions
Sport tennis
Horses foals
Snow avalanche
…
Cute animal
Vacation outdoors
Pretty things
Surprise things
visualness
5.82
5.01
4.66
4.24
4.06
4.00
3.86
3.76
3.68
3.61
3.57
0.98
2.30E-2
1.01E-4
1.01E-4
Graffiti: 46.33
London: 16.90
Artillery: 16.67
London: 36.27
Graffiti: 35.56
Art: 9.37
Bride: 11.42
Wedding: 11.37
Portrait: 3.47
Wedding: 12.55
Groom: 11.37
Bride: 11.71
Avalanche:11.02
Snow: 9.57
Washington:4.82
Harbors: 16.37
Outdoors: 11.85
Vacation: 4.52
Fig. 8. Visualness of compound concepts.
their visual and textual features. Though “ladybug” is intu-
itively a highly visualizable concept, it is scored low because
there is few ladybug images in the NUS-WIDE dataset. 1
We also found the following concepts are not visualizable:
“positive”, “planar”, “reflected”, “fund”, “sold”, “vernacu-
lar”, “grief”. Despite dataset effect, generally speaking, ab-
stract concepts are difficult to visualize.
Fig. 7 shows more results on animals. A specific concept
like “moose” is more visualizable than a general concept “an-
imal”. “Antelope” is ranked in between as it is an ambiguous
concept which can indicate a landmark or an animal.
3.3.2. Compound concepts
We discovered 26,378 visualizable compound concepts from
NUS-WIDE. Fig. 8 shows a few examples. Each compound
concept is a combination of two unique tag words. The top
visualizable compound concepts are composed of closely re-
lated terms rather than the restricted “adjective+noun” pattern
that is studied in [3], e.g. the landmark “Sphinx” is closely re-
lated to its located country “Egypt”, whereas event “wedding”
1Our method is able to take the popularity of concepts into consideration
due to the priors p(Ck) of visualsets in Eq. 12, which is reasonable as in the
web scenario, a concept is less visualizable if it has fewer relevant images.
and character “bride” will co-occur. A compound concept
may also relate to multiple visualsets as well, e.g., “London
graffiti”. The bottom-ranked compound concepts are gener-
ally a combination of general words or abstract words. For
instance, it is difficult to seize the compact visual property of
“pretty things” and “surprise things”.
4. CONCLUSION
To answer the question “how well a concept could be rep-
resented in a real visual world?”, we present in this paper a
novel method of visualness mining to automatically discov-
er and measure visualizable concepts. The method consists
of three steps: 1) Build an image heterogeneous graph with
multiple types of visual and textual features. 2) Generate vi-
sualsets with a simultaneous ranking and clustering algorith-
m. 3) Discover visualizable simple and compound concepts
with visualness estimation based on the visualsets. Evaluat-
ed on a large-scale benchmark dataset, our method achieves
promising results.
5. REFERENCES
[1] K. Yanai and K. Barnard, “Image region entropy: a measure of
“visualness” of web images associated with one concept,” in
ACM Multimedia, 2005.
[2] Y. Lu, L. Zhang, Q. Tian, and W.Y. Ma, “What are the high-
level concepts with small semantic gaps?,” in CVPR, 2008.
[3] J.W. Jeong, X.J. Wang, and D.H. Lee, “Towards measuring the
visualness of a concept,” in CIKM, 2012.
[4] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei,
“Imagenet: A large-scale hierarchical image database,” in
CVPR, 2009.
[5] X.J. Wang, Z. Xu, L. Zhang, C. Liu, and Y. Rui, “Towards in-
dexing representative images on the web,” in ACM Multimedia
Brave New Idea Track, 2012.
[6] D. Tsai, Y. Jing, Y. Liu, H. A. Rowley, S. Ioffe, and J. M. Rehg,
“Large-scale image annotation using visual synset,” in ICCV,
2011.
[7] X. Zhu, A. B. Goldberg, M. Eldawy, C. R. Dyer, and B. Strock,
“A text-to-picture synthesis system for augmenting communi-
cation,” in AAAI, 2007.
[8] T. L. Berg, A. C. Berg, and J. Shih, “Automatic attribute dis-
covery and characterization from noisy web data,” in ECCV,
2010.
[9] A. Sun and S. S. Bhowmick, “Quantifying tag representative-
ness of visual content of social images,” in ACM Multimedia,
2010.
[10] Y. Sun, Y. Yu, and J. Han, “Ranking-based clustering of het-
erogeneous information networks with star network schema,”
in KDD, 2009.
[11] J. M. Kleinberg, “Authoritative sources in a hyperlinked envi-
ronment,” J. ACM, vol. 46, no. 5, pp. 604–632, 1999.
[12] T.S. Chua, J. Tang, R. Hong, H. Li, Z. Luo, and Y. Zheng,
“Nus-wide: a real-world web image database from national u-
niversity of singapore,” in CIVR, 2009.