10
4150 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014 A Probabilistic Associative Model for Segmenting Weakly Supervised Images Luming Zhang, Member, IEEE, Yi Yang, Yue Gao, Senior Member, IEEE, Yi Yu, Changbo Wang, Member, IEEE, and Xuelong Li, Fellow, IEEE Abstract— Weakly supervised image segmentation is an impor- tant yet challenging task in image processing and pattern recognition fields. It is defined as: in the training stage, semantic labels are only at the image-level, without regard to their specific object/scene location within the image. Given a test image, the goal is to predict the semantics of every pixel/superpixel. In this paper, we propose a new weakly supervised image segmentation model, focusing on learning the semantic associations between superpixel sets (graphlets in this paper). In particular, we first extract graphlets from each image, where a graphlet is a small-sized graph measures the potential of multiple spatially neighboring superpixels (i.e., the probability of these superpixels sharing a common semantic label, such as the sky or the sea). To compare different-sized graphlets and to incorporate image- level labels, a manifold embedding algorithm is designed to transform all graphlets into equal-length feature vectors. Finally, we present a hierarchical Bayesian network to capture the semantic associations between postembedding graphlets, based on which the semantics of each superpixel is inferred accordingly. Experimental results demonstrate that: 1) our approach performs competitively compared with the state-of-the-art approaches on three public data sets and 2) considerable performance enhance- ment is achieved when using our approach on segmentation-based photo cropping and image categorization. Index Terms— Probabilistic model, weakly-supervised, segmentation, associations. I. I NTRODUCTION I MAGE segmentation is widely used in a variety of computer vision applications. For example, the photo crop- ping method proposed by Cheng et al. [6] first segments Manuscript received December 18, 2013; revised May 13, 2014; accepted July 3, 2014. Date of publication July 30, 2014; date of current version August 21, 2014. This work was supported in part by the National Natural Science Foundation of China under Grant 61125106 and in part by the Key Research Program of the Chinese Academy of Sciences under Grant KGZD- EW-T03. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Marios S. Pattichis. (Corresponding author: Y. Gao.) L. Zhang and Y. Yu are with the School of Computing, National University of Singapore, Singapore 119077. Y. Yang is with the School of Information Technology and Electrical Engineering, University of Queensland, Brisbane, QLD 4072, Australia. Y. Gao is with the Tsinghua National Laboratory for Information Science and Technology, Department of Automation, Tsinghua University, Beijing 100084, China (e-mail: [email protected]). C. Wang is with the East China Normal University, Shanghai 200241, China. X. Li is with the Center for OPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi, P. R. China. Color versions of one or more of the figures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identifier 10.1109/TIP.2014.2344433 Fig. 1. Exploiting the spatial structure and semantic associations between regions in weakly-supervised image segmentation. each photo and then probabilistically transfers the distribu- tion of pairwise segmented regions from the training photos into the cropped photo. The image categorization model by Harchaoui et al. [14] measures the similarity between images by comparing their respective tree-structured segmented regions. With segmentation as one immediate step, the perfor- mance of such applications are highly dependent on segmen- tation results, and typically assume that images are optimally segmented, i.e., each segmented region covers one semantic component. This, however, is nontrivial, and is made more difficult with an unsupervised segmentation process that was commonly applied in segmentation-based applications due to the ease of use. These segmentation methods are generally free of high-level cues and frequently generate regions that partially cover one or more segmented regions. Recent progress in the image categorization and retrieval [3], [11] community makes image-level labels widely available, and we in this work propose to leverage image-level labels to improve image segmentation. Compared with pixel-level labels in fully-supervised segmentation, segmentation using image-level labels belongs to the weakly- supervised category. Broadly weakly-supervised segmentation is a challenging issue due to the following three reasons: 1) the intrinsic ambiguity of image-level labels: compared with pixel-level labels that accurately describe object boundaries, image-level labels are much coarser cues where an effective incorporation into the segmentation process is more difficult; 2) The lack of spatial structure in measuring the homogeneity of superpixels: beyond the appearance features, the structure of a superpixel set is also essential for measuring its homogeneity. As shown in Fig. 1, the triangularly arranged superpixel structure is unique for the glass pyramid, thus these superpixels should be assigned with strong homogeneity and are encouraged to merge; and 3) the 1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

  • Upload
    xuelong

  • View
    215

  • Download
    1

Embed Size (px)

Citation preview

Page 1: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

4150 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

A Probabilistic Associative Model for SegmentingWeakly Supervised Images

Luming Zhang, Member, IEEE, Yi Yang, Yue Gao, Senior Member, IEEE,Yi Yu, Changbo Wang, Member, IEEE, and Xuelong Li, Fellow, IEEE

Abstract— Weakly supervised image segmentation is an impor-tant yet challenging task in image processing and patternrecognition fields. It is defined as: in the training stage, semanticlabels are only at the image-level, without regard to their specificobject/scene location within the image. Given a test image, thegoal is to predict the semantics of every pixel/superpixel. In thispaper, we propose a new weakly supervised image segmentationmodel, focusing on learning the semantic associations betweensuperpixel sets (graphlets in this paper). In particular, we firstextract graphlets from each image, where a graphlet is asmall-sized graph measures the potential of multiple spatiallyneighboring superpixels (i.e., the probability of these superpixelssharing a common semantic label, such as the sky or the sea).To compare different-sized graphlets and to incorporate image-level labels, a manifold embedding algorithm is designed totransform all graphlets into equal-length feature vectors. Finally,we present a hierarchical Bayesian network to capture thesemantic associations between postembedding graphlets, basedon which the semantics of each superpixel is inferred accordingly.Experimental results demonstrate that: 1) our approach performscompetitively compared with the state-of-the-art approaches onthree public data sets and 2) considerable performance enhance-ment is achieved when using our approach on segmentation-basedphoto cropping and image categorization.

Index Terms— Probabilistic model, weakly-supervised,segmentation, associations.

I. INTRODUCTION

IMAGE segmentation is widely used in a variety ofcomputer vision applications. For example, the photo crop-

ping method proposed by Cheng et al. [6] first segments

Manuscript received December 18, 2013; revised May 13, 2014; acceptedJuly 3, 2014. Date of publication July 30, 2014; date of current versionAugust 21, 2014. This work was supported in part by the National NaturalScience Foundation of China under Grant 61125106 and in part by the KeyResearch Program of the Chinese Academy of Sciences under Grant KGZD-EW-T03. The associate editor coordinating the review of this manuscript andapproving it for publication was Prof. Marios S. Pattichis. (Correspondingauthor: Y. Gao.)

L. Zhang and Y. Yu are with the School of Computing, National Universityof Singapore, Singapore 119077.

Y. Yang is with the School of Information Technology and ElectricalEngineering, University of Queensland, Brisbane, QLD 4072, Australia.

Y. Gao is with the Tsinghua National Laboratory for Information Scienceand Technology, Department of Automation, Tsinghua University, Beijing100084, China (e-mail: [email protected]).

C. Wang is with the East China Normal University, Shanghai 200241, China.X. Li is with the Center for OPTical IMagery Analysis and Learning

(OPTIMAL), State Key Laboratory of Transient Optics and Photonics, Xi’anInstitute of Optics and Precision Mechanics, Chinese Academy of Sciences,Xi’an 710119, Shaanxi, P. R. China.

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TIP.2014.2344433

Fig. 1. Exploiting the spatial structure and semantic associations betweenregions in weakly-supervised image segmentation.

each photo and then probabilistically transfers the distribu-tion of pairwise segmented regions from the training photosinto the cropped photo. The image categorization model byHarchaoui et al. [14] measures the similarity between imagesby comparing their respective tree-structured segmentedregions. With segmentation as one immediate step, the perfor-mance of such applications are highly dependent on segmen-tation results, and typically assume that images are optimallysegmented, i.e., each segmented region covers one semanticcomponent. This, however, is nontrivial, and is made moredifficult with an unsupervised segmentation process that wascommonly applied in segmentation-based applications due tothe ease of use. These segmentation methods are generallyfree of high-level cues and frequently generate regions thatpartially cover one or more segmented regions.

Recent progress in the image categorization andretrieval [3], [11] community makes image-level labelswidely available, and we in this work propose to leverageimage-level labels to improve image segmentation. Comparedwith pixel-level labels in fully-supervised segmentation,segmentation using image-level labels belongs to the weakly-supervised category. Broadly weakly-supervised segmentationis a challenging issue due to the following three reasons:1) the intrinsic ambiguity of image-level labels: comparedwith pixel-level labels that accurately describe objectboundaries, image-level labels are much coarser cues wherean effective incorporation into the segmentation process ismore difficult; 2) The lack of spatial structure in measuringthe homogeneity of superpixels: beyond the appearancefeatures, the structure of a superpixel set is also essentialfor measuring its homogeneity. As shown in Fig. 1, thetriangularly arranged superpixel structure is unique for theglass pyramid, thus these superpixels should be assigned withstrong homogeneity and are encouraged to merge; and 3) the

1057-7149 © 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

ZHANG et al.: PROBABILISTIC ASSOCIATIVE MODEL FOR SEGMENTING WEAKLY SUPERVISED IMAGES 4151

lack of association modelling: the superpixel semantics notonly relies on their appearances, but is also restricted by theassociation between spatially neighboring image regions. Forexample, for the building-covered regions shown in Fig. 1,their highly probable neighboring regions are sky-covered,rather than water-covered, although sky-covered and water-covered regions are with similar appearance. Here, comparedwith the association “a building-covered region correspondsto a water-covered neighboring region”, the association“a building-covered region comes with a neighboringsky-covered region” is much more semantically reasonable.

To address the aforementioned challenges, we proposea new weakly-supervised image segmentation method. Ourapproach learns the semantic association between spatiallystructured superpixel sets, from images with image-levellabels, and further applies the learned semantic associationto guide the semantic labeling of each superpixel. We firstconstruct graphlets to capture the appearance and structureof superpixels, where graphlets are small-sized connectedgraphs that can be considered as a natural extension of theprevious spatial-structure-free homogeneity of pairwise [37] ormultiple superpixels [23]. Since graphlets with different sizesare incomparable in the Euclidean space, we project graphletsonto the Grassmann manifold [2], [18], [26] and proposea patch alignment-based manifold embedding algorithm toencode: 1) image-level labels, 2) image global spatial layout,3) geometric context, and 4) multi-channel appearance fea-tures, into graphlets. After the embedding process, graphletsare converted into equal-length feature vectors. To capture thesemantic associations between spatially neighboring graphlets,we construct a hierarchical BN. Based on the BN, thegraphlet semantics are inferred via an appearance-based andassociation-enforced probabilistic model, while the superpixelsemantics is computed by maximum majority voting theirparent graphlets.1

II. RELATED WORK

Recently several weakly-supervised image segmentationmethods have been proposed, which focus on statistical modelsto transfer image-level labels into superpixel unary or pairwisepotentials. Verbeek et al. [37] proposed an aspect model toestimate pixel-level labels for each image, where each imageis modeled as a mixture of latent topics. Vezhnevets et al. [35]presented a multiple instances learning (MIL) [5] frameworkfor segmenting weakly-supervised images. Specifically, eachsuperpixel is represented as an instance, each image is denotedas a bag of instances, and only the bag label is known.Thus, image segmentation is formulated as instance labelinference. One limitation of [35] and [37] is the lacking ofsmoothing the labels between pairwise superpixels. To solvethis problem, Verzhnevetz et al. [36] proposed a graphicalmodel, termed multi-image model (MIM), which integratesimage appearance features, image-level labels, and superpixellabels into one network. Later, the same authors [39] extendedthe MIM-based segmentation framework to further boost

1Parent graphlets denote graphlets whose constituent superpixels are super-sets of the pairwise superpixels.

its performance. First, the superpixels were semanticallylabeled using MIM. Next, to refine the semantic labels, anactive learning [17] mechanism was designed to select a fewmost semantically uncertain superpixels within an image, andto accurately label the selected superpixels by querying anoracle database. Finally, the labels of the remaining superpix-els are inferred though the pairwise potentials of conditionalrandom field [21]. The oracle database, however, is usuallylarge-scale thus the querying process for the semanticallyuncertain superpixels requires large amount of computation,which limits the application of [39]. Noticeably, the single-channel visual feature used in [35] is with limited descriptivepower. Therefore, Verzhnevetz et al. [40] further developed aparametric family of structured models, where multi-channelvisual features are considered to form the pairwise potential,and the weight of each channel is decided by minimizing thedistinction between superpixels labeled by differently trainedsegmentation models. All these algorithms result improvedsegmentation performance in a weakly-supervised setting, butare limited with the low descriptiveness of the unary orpairwise potentials. For example, a considerable amount ofambiguous segment boundaries are observed; furthermore,the semantic associations between regions is not consid-ered explicitly, which may lead to perceptually unreasonablesegmented regions.

To incorporate more descriptive superpixel potentials,Kohli et al. [16] proposed high-order conditional random fieldsfor image segmentation, where the high-order potentials aredefined over multiple pixels. In [23], Rital et al. general-ized the conventional normalized cut into hypergraph cut,where each hyperedge connects multiple spatially neighboringsuperpixels. However, hypergraph cut has two weaknesses:1) supervision incorporation is difficult, and 2) label infer-ence is inefficient. To overcome these two limitations,Kim et al. [15] developed supervised high-order correlationclustering for image segmentation. Based on the structuredsupport vector machine and the linear programming relaxation,both the parameter learning and segmentation processes arecarried out efficiently. Compared with pairwise potential,segmentation using high-order potentials successfully avoidslocal boundary ambiguities. Nevertheless, the spatial structureof superpixels, an important cue to measure their homogeneity,is again not considered. Besides, these approaches are eitherunsupervised- or fully-supervised, and it is difficult to trans-form them into a weakly-supervised version. In [27] and [38],Zhang et al. proposed a weakly-supervised segmentationalgorithm by learning the distribution of spatially structuredsuperpixel sets, i.e., graphlets, from image-level labels. How-ever, this approach has two drawbacks: 1) multi-channel visualfeatures such as color and texture contributes differently [1] inthe segmentation process, but they are assigned with the sameweight by Zhang et al.; 2) the semantic associations betweensegmented regions is not considered, which potentiallyresults in unreasonable segmentation results. In this work,a patch-alignment-based embedding algorithm and a Bayesiannetwork are proposed to tackle the two problems respectively.Experimental results clearly demonstrate the advantage of ourproposed work over that proposed by Zhang et al. [27].

Page 3: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

4152 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

III. SPATIAL STRUCTURE-EXPLOITED HOMOGENEITY

In this work, we use simple linear iterative clustering(SLIC) [4] to generate superpixels. Compared with sev-eral other well-known pixel clustering algorithms such ask-means [8] and normalized cut [24], SLIC is significantlyfaster to compute, consumes much less memory, and producessuperpixels more adherent to object boundaries. As mentioned,the spatial arrangement of superpixels is an important cue thatshould be exploited to measure their homogeneity. Thus, weuse graphlet [28] to capture the homogeneity of a spatiallystructural superpixel set, i.e., a graphlet is a small-sizedconnected graph defined as:

G = (V , E) (1)

where V is set of vertices, each representing a superpixel; E isa set of edges, each connecting spatially adjacent superpixelsin V . The graphlet size denotes the number of its constituentsuperpixels.

We use small-sized graphlets in this work as: 1) given animage, the number of its graphlets increases exponentiallywith its size, 2) the graphlet embedding implicitly broadenthe homogeneity of the small-sized graphlets (as discussedin Section IV), and 3) empirical results show that the seg-mentation accuracy stops increasing when the graphlet sizeincreases from 5 to 10, suggesting that small-sized graphletsare descriptive enough. Let T denote the maximum graphletsize, we extract graphlets of all sizes ranging from 2 to T .The graphlet extraction is based on depth-first-search, which iscomputationally efficient, i.e., computation time increases lin-early with the number of superpixels plus that of the spatiallyneighboring superpixel pairs. Besides, the proposed methodis also storage efficient: for an image with 50 superpixels,and assuming the average superpixel degree2 is 5 and themaximum graphlet size is also 5, there are 50 · 55/5! + · · · +50 · 52/2! ≈ 4300 graphlets, which, after embedding, aretransformed into 4300 low dimensional feature vectors. Thus,the required storage space is rather small.

In the image segmentation community, the homogeneityof superpixels is a basic and fundamental concept whichreflects the probability of pairwise or multiple superpixelssharing a common semantic label. It is important to notethat, as the homogeneity of a spatially structural superpixelset, graphlet-based homogeneity naturally extends the previousnon-structural superpixel set homogeneity [15], [16]. Partic-ularly, both the pairwise and high-order potentials representthe homogeneity of orderless superpixels, whereas the graphletrepresents the homogeneity of spatially structured superpixels.If we ignore the topology encoded in graphlets, the proposedgraphlet-based homogeneity reduces to the high-order poten-tial homogeneity.

Both the appearance of each superpixel and their spatialstructure are important in a descriptive homogeneity measure.As shown in Fig. 2, the superpixel appearance in graphletsG3 and G4 are similar, yet the spatial structures are quite dif-ferent. Compared with the triangular structure of graphlet G3

2The degree of a superpixel means the number of spatially connectedsuperpixel with respect to it.

Fig. 2. Superpixels and their spatial structure both contribute to the graphlet-based homogeneity.

which is unique for the glass pyramid, the linear structure ofgraphlet G4 is more common in a variety of semantic objects(e.g., the road in graphlet G5), thus superpixels in graphlet G4are assigned with weaker homogeneity than those in graphletG3; on the other hand, the appearance of superpixels ingraphlet G1 and G2 differ significantly but the spatial struc-tures are similar. The fallow superpixels in graphlet G1 areunique for building, while the blue superpixels in graphlet G2appear commonly in objects/scenes (e.g., sky, water), thussuperpixels in graphlet G1 are assigned with stronger homo-geneity than those in graphlet G2.

The above observation reveals two points. First, the homo-geneity of a graphlet’s constituent superpixels is determinedby how closely it correlates with one type of semantics.3

Second, accurately revealing the semantics of a graphlet relieson effectively characterizing both graphlet’s structure and itsconstituent superpixels’ appearance. To represent the structureof a t-sized graphlet, we use a t × t-sized matrix defined as:

MS(i, j)={

θ(Ri , R j ) if Ri and R j are spatially adjacent0 otherwise

(2)

where θ(Ri , R j ) is the angle between the horizontal axis andthe vector from the center of superpixel Ri to the center ofsuperpixel R j . To represent the appearance of the constituentsuperpixel of a graphlet, both color and texture features areused. Specifically, for a t-sized graphlet, we denote the texturechannel of its constituent superpixel R as a matrix MT

R , whereeach row of MT

R represents a 128-dimensional histogramof gradient (HOG) [9]. Similarly, the color channel of thesuperpixel is represented as MC

R , where each row of MCR

denotes the 9-dimensional color moment [29].Based on the three matrices MT

R , MCR , and MS , we represent

a graphlet in its texture channel using a t × (128 + t)-sizedmatrix MT = [MT

R, MS ] and in its color channel using at × (9 + t)-sized matrix MC = [MC

R , MS]. To measure thesimilarity of two identical-sized graphlets in either texture orcolor channel, inspired by differential geometry, their corre-sponding matrices are deemed as points on the Grassmannmanifold [25], [41], and the Golub-Werman distance is definedas:

dGW (M, M′) = ||Mo − M′o||l2 (3)

3This property is reflected by the proposed hierarchical BN, where semanticsof a superpixel is determined by maximum majority voting those of its parentgraphlets (see details in Section V.)

Page 4: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

ZHANG et al.: PROBABILISTIC ASSOCIATIVE MODEL FOR SEGMENTING WEAKLY SUPERVISED IMAGES 4153

Fig. 3. Left: four graphlets (each denoted as a matrix) on the Grassmannmanifold; Right: four post-emebedding graphlets in the Euclidean space.

Fig. 4. Adding geometric context into graphlets: ground (green), differentoriented vertical regions (red), and non-planar solid (‘x’).

where M and M′ are two identical-sized matrices; Mo and M′o

are their orthonormal basis respectively.4

IV. PATCH-ALIGNMENT-BASED MANIFOLD

GRAPHLET EMBEDDING

As mentioned above, semantically consistent superpixelswithin a graphlet reflect strong homogeneity, suggesting tointegrate category information into graphlets in measuring thehomogeneity of superpixels. To do this, inspired by the recentprogress in manifold embedding [2], we deemed graphletsas points on the Grassmann manifold, and propose a mani-fold embedding algorithm to encode image-level labels intographlets.

In addition to image-level labels, three supplementary cues(i.e., global spatial layout, geometric context, and multi-channel appearance features) are also incorporated into theembedding. The contributions of each term is elaboratedbelow: first, the global spatial layout information is use-ful in maximally preserving the relative distances betweengraphlets in the embedding scheme, which is helpful toexpand the homogeneity of superpixels across individualgraphlets. As shown on the left figure of Fig. 3, preservingthe relative distances between graphlets implicitly encodesglobal spatial layout of each image, which implicitly broad-ens the homogeneity of the individual small-sized graphlets.Second, as demonstrated by Vezhnevets et al. [35], geometriccontext [13] effectively complements image-level labels forimage segmentation. Specifically, geometric context refers tothe categorization of each pixel in an image into ground,different oriented vertical regions, non-planar solid, or porous.Intuitively, a graphlet with consistent geometric context shouldreflect stronger homogeneity. As shown in Fig. 4, graphlet G1has more consistent geometric context than graphlet G2, thus

4The orthonormal basis are calculated by Matlab script orth(M).

superpixels within G1 should be assigned with stronger homo-geneity than those within G2. Third, as demonstrated by manyimage segmentation works [1], multiple appearance featurescontributes to image segmentation collaboratively, thus weintegrate them into the embedding process.

To capture the four cues, namely, image-level labels, globalspatial layout, geometric context, and multi-channel appear-ance features, we propose a graphlet embedding algorithm,which contains two parts: patch optimization and globaloptimization. Patch optimization obtains the low-dimensionalrepresentation of graphlets within each image, in the k-thchannel. Formally, to integrate global spatial layout intographlets, we preserve all the distance between graphlets in theembedding process, as shown in Fig. 3. The objective functioncan be defined as:

arg minY(k)

∑i j

[dGW (Mi (k), M j (k))−dE(yi (k), y j (k))

]2 ·φiφ j

(4)

where Mi (k) is the matrix corresponding to the i -th graphletin the k-th channel and yi (k) its low dimensional rep-resentation, dE (·, ·) is the Euclidean distance. The term∑

i j [dGW (Mi (k), M j (k)) − d(yi (k), y j (k))]2 accumulates thediscrepancy between pairwise graphlet distance on the Grass-mann manifold and that in the Euclidean space. φi =− ∑

j gi ( j)log2gi( j) is the entropy reflecting the geometriccontext purity of the i -th graphlet, where gi ( j) is the propor-tion of the j -th geometric context in graphlet Gi . A smallerφi indicates more consistent geometric context of thegraphlet Gi .

Denoting DhGW = [dGW (Mh

i , Mhj )] as the matrix whose

entity dGW (Mhi , Mh

j ) is the Golub-Werman distance betweenthe i -th and j -th identical-sized graphlets extracted from theh-th image. Its inner product matrix is obtained by:

τ (DhGW ) = −RNh Sh

GW RNh /2 (5)

where (ShGW )i j = (Dh

GW )2i j , and RNh = INh − �eNh �eT

Nh/N

which is the centralization matrix. INh is an Nh × Nh identitymatrix, �eNh = [1, 1, . . . , 1]T ∈ R

Nh , N is the number ofall training graphlets, and Nh the number of graphelts fromthe h-th training image. Thus the first part of (4) can berewritten as:

arg minY(k)

∑i j

||τ (DGW (k)) − τ (DY (k))||2 · φiφ j

= arg maxY(k) tr(Y(k)τ (DGW (k))�Y(k))T ) (6)

where � is a N × N matrix and the i j -th element is φiφ j .To integrate image-level labels into graphlets, another objec-

tive function is defined as:

arg minY(k)

⎡⎣∑

i j

||yi (k)−y j(k)||2li js −

∑i j

||yi (k)−y j(k)||2li jd

⎤⎦·φiφ j

(7)

where Y(k) = [y1(k), y2(k), . . . , yN (k)], li js and li j

d respec-tively represent image-level categorical similarity and dis-similarity between graphlets, which is computed as follows:denote c(G) the C-dimensional row-wise vector containing

Page 5: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

4154 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

the class label of the image corresponding to graphlet G and�N = [N1, N2, . . . , NC ]T the C-dimensional column vectorcontaining the number of images per category, we define

li js = [c(Mi) ∩ c(M j )] �N∑

c Nc

and

li jd = [c(Mi )

⊕c(M j )] �N∑

c Nc.

The numerator of ls denotes the number of photos in com-mon categories with the photos where the two graphlets areextracted; the numerator of ld is the number of photos indifferent categories with the photos where the two graphletsare extracted; and the denominator represents the total numberof photos in all categories with the photos where the twographlets are extracted. After some derivations, (7) can bereorganized into:

arg maxY tr(YA(k)YT ) (8)

wherein A(k) = ∑Ni=1 [I1:i−1

N−1 , − �eTN−1, Ii:N−1

N−1 ]T Wi (k)

[I1:i−1N−1 ,−�eT

N−1, I i:N−1N−1 ]; Wi (k) is an N × N diagonal matrix

and the h-th diagonal element is [lhis (k) − lhi

d (k)] · φhφi .By accumulating the part optimizations (6) and (8) from all

images and from all visual channels, we obtain:

arg minY,α

N∑j=1

K∑k=1

αk tr(YS jτ (D jGW (k))�(S j )T YT )+αkYA(k)Y T

=arg minY,α

K∑k=1

αk tr(YZ(k)YT ) s.t . YYT =I,K∑

k=1

αk =1, αk ≥0

(9)

where S j is a N × N j -sized diagonal matrix, i.e., from all theN graphlets, if the a-th graphlet comes from the j -th image,then the a-th diagonal element is assigned with 1, otherwiseit assigned with 0; Z(k) = ∑N

j=1 S j τ (DGW )(S j )T + A(k).To solve (9), we rewrite (9) as:

arg minY,α

∑K

k=1αr

k tr(YZ(k)YT )

s.t .YYT = I;∑K

k=1αk = 1, αk ≥ 0 (10)

where r > 1 is a parameter denoting the correlation of themultiple channels, and a large r reflects a high correlation.In our implementation, we set r = 2, and segmentation per-formance under different r value is reported in the experimentsection (Section VI).

The above objective function can be solved via alternatingoptimization. By using a Lagrange multiplier to take the con-straint

∑Kk=1 αk = 1 into consideration, we get the Lagrange

function as follows:

L(α, λ) =∑K

k=1αr

k tr(YZ(k)YT ) − λ(∑K

k=1αk − 1) (11)

By setting the derivatives of L(α, λ) with respective to αk

and λ to zero, we obtain

αk = 1/(tr(YZ(k)YT ))1/(r−1)∑Kk=1 1/(tr(YZ(k)YT ))1/(r−1)

(12)

Algorithm 1 Patch-Alignment-Based Graphlet Embedding

With αk obtained, we need to update Y, and the optimizationproblem is equivalent to:

minY

tr (YZYT ) s.t . YYT = I (13)

where Z = ∑Kk=1 αr

k Z(k). Equation (13) is a quadraticprogramming problem with quadratic constraint that can besolved with eigenvalue decomposition, which has a timecomplexity of O(N3). However, Z is a large-sized matrixbecause usually N > 50, 000. Thus, it is computationalintractable to solve (12) using a global once-for-all eigen-value decomposition on matrix Z . Instead, we decompose theeigenvalue decomposition into a set of subproblems. First, wesolve an initial embedding Y(0) using (13) under N (0) traininggraphlets, where N (0) � N . Then we use coordinate propaga-tion to transmit the embedded coordinates to the new graphlets.The coordinate propagation is computationally efficient basedon the iterative algorithm proposed by Xiang et al. [42]. Notethat, for example, give 10 the maximum graphlet size, thegraphlet embedding is carried out 10 times. In the t-th oper-ation, all t-sized graphlets are embedded. And different-sizedgraphlets are transformed into the same dimensional featurevectors.

The procedure of the proposed graphlet embeddingalgorithm is summarized in Algorithm 1. Noticeably, thealternating optimization is guaranteed to converge, since theobjective function (10) reduces continuously in the iterativeprocess.

V. AN ASSOCIATIVE BAYESIAN NETWORK FOR

SUPERPIXEL SEMANTICS INFERENCE

The post-embedding graphlets are representative due tothe multi-cue integration. However, the semantic associationsbetween graphlets, the key factor for achieving a seman-tically reasonable segmentation result, is still not consid-ered. For example, for a boat-covered graphlet, its highlyprobable neighboring graphlets are water-covered, rather thansky-covered, although water- and sky-covered graphlets havesimilar appearance. To integrate graphlet semantic associationsinto the segmentation model, a hierarchical BN is proposed.

As shown in the top of Fig. 5, the hierarchical BNcontains three layers, denoting post-embedding graphlets,graphlet semantics, and superpixel semantics respectively. Theone-to-one connections from nodes in the first layer to those inthe second layer represent appearance-based graphlet semantic

Page 6: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

ZHANG et al.: PROBABILISTIC ASSOCIATIVE MODEL FOR SEGMENTING WEAKLY SUPERVISED IMAGES 4155

Fig. 5. The hierarchical BN for semantics inference (top), where the semanticassociations between graphlets (dotted box) are detailed in the hierarchy(bottom).

inference. The many-to-one connections from the nodes in thesecond layer to those in the third layer illustrate that if multiplegraphlets share a superpixel, they collaboratively determine thesemantics of the shared superpixel.

As shown in the bottom of Fig. 5, the semantic associationsbetween graphlets are modeled by the hierarchy constructedas follows. Firstly, we select the P1 most semantically discrim-inative graphlets whose post-embedding graphlets are mostconfidently categorized by a semantics classifier, and furthertreat them as seed graphlets which form the first layer. To buildthe second layer, we collect graphlets that are neighbors ofthose in the first layer. These graphlets are termed childrenwith respective to their spatially neighboring parent graphletsin the first layer. Because the parent and its child are semanti-cally correlated, we connect them through a directed edge.The following hierarchical layers are constructed similarlyuntil all graphlets are added to the hierarchy. Noticeably,the semantics classifier is a multi-class linear SVM trainedas follows: for each training image, we treat it as a 1-sizedgraphlet and compute its post-embedding graphlet. These post-embedding graphlets associated with their image-level labelsare used to train a linear multi-class SVM as the semanticsclassifier.

After constructing the hierarchical BN, the semantics ofeach graphlet (from the second layer downwards) is inferredbased on both its appearance (i.e., post-embedding graphlet)and the associative enforcement from the parent graphlet.Particularly, semantics of a graphlet from the second layeris computed as:

arg maxs(G21),s(G2

2),...,s(G2P2

) p(s(G21), s(G2

2), . . . , s(G2P2

))︸ ︷︷ ︸appearance−based sema. in f er.

·

p(s(G21), s(G2

2), . . . , s(G2P2

)|s(G11), s(G1

2), . . . , s(G1P1

))︸ ︷︷ ︸associat ion−en f orce. sema. in f er.

(14)

The first and the second term in (14) are detailed as follows:

p(s(G21), s(G2

2), . . . , s(G2P2

)) =∏P2

i=1p(s(G2

i )) (15)

Algorithm 2 The Graphlet-Guided Bayesian Network forWeakly-Supervised Segmentation

where p(s(G2i )) is the probabilistic output of the semantics

SVM classifier [14], i.e.,

p(s(G2i )) = 1

1 + exp(−ω(y(G2i )))

,

here ω(y(G2i )) is the linear function of SVM and y(G2

i ) thepost-embedding graphlets of G2

i .

p(s(G21), s(G2

2), . . . , s(G2P2

)|s(G11), s(G1

2), . . . , s(G1P1

))

=∏

G∈ϕ(G21)

p(s(G21)|s(G)) ∗ · · · ∗

∏G∈ϕ(G2

P2)

p(s(G2P2

)|s(G)) (16)

where ϕ(·) collects the parents of a graphlet, p(s(G2i )|s(G))

is the probability of graphlets G2i with semantics s(G2

i ),conditioning on its parent G with semantics s(G). In thiswork, p(s(G2

i )|s(G)) is defined by a 2-dimensional Gaussianmixture model (GMM), i.e.,

p(s(G2i )|s(G)) =

∑Q

q=1wqN (s(G2

i ) ∪ s(G)|μq ,q ) (17)

where N (s(G2i )

⋃s(G)|μq ,q ) is the q-th Gaussian

component, i.e., N (s(G2i ) ∪ s(G)|μq ,q) =

1/((2π)1/2|q |1/2 exp(− 12 (s(G2

i )∪ s(G)−μq)T −1q (s(G2

i )∪s(G) − μq)), Q is the Gaussian component number, andthe parameters {wq ,μq ,q} are determined using expectationmaximization.

After computing the semantics of each graphlet, we prob-abilistically aggregate them into the superpixel semantics, viamaximum majority voting, i.e.,

p(s(R)) = arg maxc

∏G⊃R

p(s(G) → c) · p(|G|) (18)

where G ⊃ R collects all graphlets containing super-pixel R, p(s(G) → c) denotes the probability of graphlet Gbelonging to the c-th category, p(|G|) ∝ exp(−1/|G|) encodesthe weight of graphlets with different sizes, i.e., larger-sized graphlets predict semantics more confidently, hence areassigned with larger weights.

Page 7: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

4156 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

TABLE I

DETAILS OF THE THREE DATA SETS

Summarizing the discussion from Section II to Section IV,Algorithm 2 outlines the procedure of the proposed graphlet-guided weakly-supervised segmentation algorithm.

VI. EXPERIMENTAL RESULTS AND ANALYSIS

A. Comparison With the State-of-the-Art

This experiment compares the proposed approach withseven other segmentation methods: four weakly-supervisedand three fully-supervised. The four weakly-supervised meth-ods are 1) multiple-instance-learning (MIL)-based weakly-supervised segmentation [35] that casts weakly-supervisedimage segmentation as multiple instance learning [5]; 2) themulti-image model (MIM) [5] that formulates image segmen-tation as superpixel semantics inference based on a graphicalmodel; 3) structural output learning (SOL) [40] that appliesmulti-channel visual features to form the pairwise potential forcalculating superpixel semantics; and 4) probabilistic graphletcut (PGC) [27] that uses the distribution of graphlets learnedfrom image-level semantics to improve the conventionalnormalized cut. The three fully-supervised methods are5) textonboost (TB) [30] that is based on a discriminativemodel for segmenting semantically meaningful objects in animages; 6) hierarchical random field (HCRF) [20] that inte-grates different types of visual features into the segmentationmodel; and 7) constrained parametric min cut (CPMC) [7]that merges superpixels based on a combination of bottom-uplow-level and mid-level cues.

To quantitatively evaluate the aforementioned methods,three data sets, the SIFT-flow, the MSRC-21 [30], and theVOC 2009 [10], are used. The first two data sets are suitablefor weakly-supervised segmentation experiments. Comparedwith many segmentation data sets with only pixel-level labelsof foreground objects, such as the PASCAL VOC series, bothimage-level labels (for foreground and background objects)and pixel-level labels are provided in the SIFT-flow and theMSRC-21. The third data set (i.e., the VOC 2009) is popularlyused in fully-supervised segmentation experiments, and com-parative results show the competitiveness of our approach withtheir fully-supervised counterparts. Note that, only foregroundobjects are pixel-level labeled in images from the VOC 2009,and we use them as foreground image-level labels. To obtainthe background image-level labels, we manually label thebackground of each image as one of the {sky, road, indoor}categories, and further combine the foreground image-levellabels and the background ones into the overall image-levellabels. The details of the three data sets are given in Table I.

1Actually, it is unfair the compare the proposed weakly-supervised segmen-tation algorithm with the fully-supervised one, as they require different inputs.In our experiment, we compare them to show the performance gap betweenfully- and weakly-supervised image segmentation.

TABLE II

QUANTITATIVE COMPARISONS WITH OTHER SEVEN SEGMENTATION

METHODS USING AVERAGE PER-CLASS MEASURE

The segmentation performance is evaluated by an averageper-class measure, which averages the correctly classifiedpixels per-class over all classes. One advantage of averageper-class measure is that each semantic category contributesequally regardless of its constituent pixels, such as sky andcar. In Table II, we report the comparative segmentation per-formance on the three data sets, and the following observationsare made.

• Our approach significantly outperforms the four weakly-supervised segmentation methods, which demonstratesthat image-level labels are more effectively encoded byour approach.

• The proposed algorithm performs competitively to thethree well-known fully-supervised segmentation methods.This suggest that, although image-level labels are muchcoarser compared with pixel-level labels, the gap in per-formance with image-level labels is minimum comparedwith that with pixel-level labels if the image-level labelsare appropriately exploited.

• The proposed approach beats PGC on the three datasets consistently, demonstrating the effectiveness ofthe two improvements of the proposed approach fromPGC [27], i.e., the dynamic multi-channel graphletweighting mechanism and the association-enforced regionsemantics inference framework.

• Compared with CPMC, the best performer on the VOC2009 and 2010, our approach does not perform aswell. This is largely due to the fact that the superpixelcombination-ranking mechanism in CPMC captures moreaccurate high-level cues, compared with the BN algorithmused in our approach. Note that, however, this does notmean CPMC is more powerful in high-level cue integra-tion, but compared with the image-level labels used in ourmethod, the pixel-level labels used in CPMC are moreaccurate in capturing high-level cues. Two weaknessesof CPMC are: 1) the pixel-level labels are much morelabor-intensive to acquire; 2) it takes considerably morecomputational time as well as memory consumption inboth its training and test stage, due to the enumerativesuperipixel combination operation.

B. Applications to Segmentation-Based Tasks

Beyond semantically segmenting an image, the proposedapproach can be used in a variety of segmentation-based tasks.In this experiment, we show the performance enhancement ofsegmentation-based image categorization and photo cropping

Page 8: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

ZHANG et al.: PROBABILISTIC ASSOCIATIVE MODEL FOR SEGMENTING WEAKLY SUPERVISED IMAGES 4157

TABLE III

CATEGORIZATION ACCURACIES WITH DIFFERENT SEGMENTATION

SCHEMES (D: DEFICIENTLY SEGMENTED; M: MODERATELY

SEGMENTED; O: OVERLY SEGMENTED)

by replacing their unsupervised segmentation module with theproposed weakly-supervised one.

First, we apply the proposed method to segmentation graphkernel [14], a category-level image classification model thatmeasures the similarity between images by comparing theirrespective tree-structured segmented regions. As the pre-processing step, segmentation is critical in the categorizationmodel. However, the watershed segmentation algorithm [14]used in the classification work fails to integrate high-levelcues, resulting in suboptimal segmentation results and furtherinfluence negatively to the categorization model. To refine thesegmentation, we replace the unsupervised watershed algo-rithm with the proposed method. Particularly, to obtain image-level labels, following Cheng et al. [6]’s experimental setting,we use the well-known Gist descriptor [22] to represent eachimage by a 960-dimensional feature vector and then usek-means to cluster the images into 10 centers, where eachcenter label is used as the image-level label.

Table III presents the categorization accuracy of [14] underdifferent segmentation algorithms: watershed, normalized cut,and the proposed approach, on the VOC 2009. Three segmen-tation settings: deficiently segmented, moderately segmented,and overly segmented, are applied to decompose each imageinto 5 ∼ 10, 10 ∼ 20, and 20 ∼ 40 segmented regions respec-tively. As shown in the table, the best categorization accuracyis achieved by the proposed approach. It is worth to mentionthat for a fair comparison, the k-means centers, coarse image-level labels computed in an unsupervised manner, are used,which still leads to significant performance improvement.In practice, the performance can be further boosted if thek-means centers are replaced with more accurate image-levellabels by, for example, training a multi-class SVM to predictthe image-level labels.

Second, we apply our approach to the photo cropping frame-work proposed by Cheng et al. [6]. Briefly, to represent photoaesthetics, Cheng et al. presented omni-range context to modelthe spatial correlation of arbitrary pairwise segmented regionswithin an image. To evaluate the aesthetics of each candidatecropped photo, the omni-range context priors learned fromthe training images are integrated into a probabilistic measure.Finally the most aesthetically pleasing candidate is selected.As the most basic element to construct omni-range con-text, the quality of segmented regions influence the croppingperformance significantly. In Cheng et al.’s experiment, theimages are segmented using graph-based segmentation (GBS)

Fig. 6. Average segmentation accuracy and running time with differentmaximum graphlet sizes.

by Felzenszwalb et al. [12], which fails to incorporate high-level cues and results in less satisfactory segmented region.Similar to the image categorization experiment, we replacethe GBS by the proposed approach and report comparativecropping results. Particularly, three experimental settings areused to deficiently segment, moderately segment, and overlysegment each image. Paired comparison based user study iscarried out to quantitatively evaluate and compare the croppingperformance. In particular, it works by presenting each subjectwith a pair of cropped photos for them to indicate preference.In our experiment, the participants are 40 amateur/professionalphotographers from Zhejiang University.

C. Parameter Analysis

In the proposed algorithm, the maximum graphlet size Tand the dimensionality of post-embedding graphlets d aretwo important parameters, and this subsection evaluates theirinfluence to the segmentation performance. We use theSIFT-flow [19] data set for this experiment as compared withthe MSRC-21 [30] and the VOC 2009 [10], the SIFT-flowcontains more semantic categories, which makes it a betterchoice for illustrating the influence of different parametersettings on different semantic categories.

Fig. 6 presents the average segmentation accuracy andtime consumption with T varying from 1 to 10 on theSIFT-flow. T larger than 10 is not considered as it lead to toomany graphlets thus making the training stage computationalintractable. According to Fig. 6, two observations are made.First, average segmentation accuracy increases consistentlywith T when T is no larger than 5, but stays unchangedwhen T changes from 5 to 10, suggesting the sufficiency of5-sized graphlets to capture the homogeneity of superpixels.Second, average segmentation time increases approximatelyexponentially with the graphlet size, suggesting a small T .Based on the two observations, we set T to 5 on this data set.

Next we present in Fig. 7 the average segmentation accuracywith the dimensionality of post-embedding graphlets rangingfrom 10 to 130, with a step of 10, again on SIFT-flow [19].The maximum dimensionality is set to 130 because the patch-alignment-based graphlet embedding fuses color and texture

Page 9: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

4158 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 23, NO. 9, SEPTEMBER 2014

Fig. 7. Average segmentation accuracy and running time with different valuesof the dimensionality of post-embedding graphlets.

Fig. 8. Example segmentation results with different maximum graphlet sizes(first two rows) and different values of the dimensionality of post-embeddinggraphlets (the remaining rows).

channel descriptors, leading to a 137-dimensional vector.As shown in Fig. 7, the best segmentation results are achievedwhen d is between 40 and 60, and we therefore set d = 50on the SIFT-flow data set.

In addition to the analyses of segmentation performanceunder different parameter settings on the SIFT-flow, we presentan example segmentation result under different graphlet sizeand dimensionality of post-embedding graphlets in Fig. 8.As illustrated in the first two rows, segmentation results areunsatisfactory and unstable when graphlet size is tuned from1 to 4, while better results are observed when graphlet size isbetween 5 to 10. As shown from the third to the fifth rows,segmentation results are suboptimal and fluctuated when thethe dimensionality of post-embedding graphlets are between10 to 30 or 70 to 130, and more satisfactory and stable whenbetween 40 to 60. This visual observation is consistent withthe quantitative performance analyzed above.

VII. CONCLUSIONS

Weakly-supervised image segmentation is a useful tech-nique that can be widely applied in computer vision [31], [32]

and image processing [33], [34]. This paper introduced acasual BN framework for weakly-supervised image seg-mentation. First, graphlets are extracted to represent thehigh-order structural potentials among superpixels. Then,a patch-alignment-based embedding encodes 1) image-levellabels, 2) image global spatial layout, 3) geometric con-text, and 4) multi-channel appearance features, into graphlets.Thereafter, a hierarchial BN is constructed and superpixelsare labeled by an appearance-based and association-enforcedconstraint. Comparative experiments on three data sets demon-strate that the proposed approach outperforms four weakly-supervised methods and performs competitively to threefully-supervised ones.

REFERENCES

[1] S. Alpert, M. Galun, A. Brandt, and R. Basri, “Image segmentation byprobabilistic bottom-up aggregation and cue integration,” IEEE Trans.Pattern Anal. Mach. Intell., vol. 34, no. 2, pp. 315–327, Feb. 2012.

[2] N. Guan, D. Tao, Z. Luo, and B. Yuan, “Manifold regularized discrimi-native nonnegative matrix factorization with fast gradient descent,” IEEETrans. Image Process., vol. 20, no. 2, pp. 2030–2048, Jul. 2011.

[3] Y. Gao, M. Wang, Z.-J. Zha, J. Shen, X. Li, and X. Wu, “Visual-textualjoint relevance learning for tag-based social image search,” IEEE Trans.Image Process., vol. 22, no. 1, pp. 363–376, Jan. 2013.

[4] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk,“SLIC superpixels compared to state-of-the-art superpixel methods,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 34, no. 11, pp. 2274–2282,Nov. 2012.

[5] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support vectormachines for multiple-instance learning,” in Advances in Neural Infor-mation Processing Systems. Cambridge, MA, USA: MIT Press, 2003,pp. 561–568.

[6] B. Cheng, B. Ni, S. Yan, and Q. Tian, “Learning to photograph,” inProc. Int. Conf. Multimedia, Oct. 2010, pp. 291–300.

[7] J. Carreira and C. Sminchisescu, “CPMC: Automatic object segmenta-tion using constrained parametric min-cuts,” IEEE Trans. Pattern Anal.Mach. Intell., vol. 34, no. 7, pp. 1312–1328, Jul. 2012.

[8] C. Ding and X. He, “K-means clustering via principal componentanalysis,” in Proc. ICML, Jul. 2004, pp. 225–232.

[9] N. Dalal and B. Triggs, “Histograms of oriented gradients for humandetection,” in Proc. CVPR, Jun. 2005, pp. 886–893.

[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. (2009). The PASCAL Visual Object ClassesChallenge 2009 (VOC2009) Results. [Online]. Available: http://www.pascal-network.org/challenges/VOC/voc2009/workshop/index.html

[11] Y. Gao, M. Wang, D. Tao, R. Ji, and Q. Dai, “3-D object retrievaland recognition with hypergraph analysis,” IEEE Trans. Image Process.,vol. 21, no. 9, pp. 4290–4303, Sep. 2012.

[12] P. F. Felzenszwalb and D. P. Huttenlocher, “Efficient graph-basedimage segmentation,” Int. J. Comput. Vis., vol. 59, no. 2, pp. 167–181,Sep. 2004.

[13] D. Hoiem, A. A. Efros, and M. Hebert, “Geometric context from a singleimage,” in Proc. CVPR, 2009, pp. 1–8.

[14] Z. Harchaoui and F. Bach, “Image classification with segmentation graphkernels,” in Proc. ICCV, Jun. 2007, pp. 1–8.

[15] S. Kim, S. Nowozin, P. Kohli, and C. D. Yoo, “Higher-order correlationclustering for image segmentation,” in Advances in Neural Informa-tion Processing Systems . Cambridge, MA, USA: MIT Press, 2011,pp. 1530–1538.

[16] P. Kohli, L. Ladicky, and P. H. S. Torr, “Robust higher order potentialsfor enforcing label consistency,” Int. J. Comput. Vis., vol. 82, no. 3,pp. 302–324, 2009.

[17] A. Kapoor, K. Grauman, R. Urtasun, and T. Darrell, “Active learningwith Gaussian processes for object categorization,” in Proc. ICCV,Oct. 2007, pp. 1–8.

[18] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Man-ifold regularized multitask learning for semi-supervised multilabelimage classification,” IEEE Trans. Image Process., vol. 22, no. 2,pp. 523–536, Feb. 2013.

[19] C. Liu, J. Yuen, and A. Torralba, “Nonparametric scene parsing:Label transfer via dense scene alignment,” in Proc. CVPR, Jun. 2009,pp. 1972–1979.

Page 10: A Probabilistic Associative Model for Segmenting Weakly-Supervised Images

ZHANG et al.: PROBABILISTIC ASSOCIATIVE MODEL FOR SEGMENTING WEAKLY SUPERVISED IMAGES 4159

[20] L. Ladický, C. Russell, and P. H. S. Kohli, “Associative hierar-chical CRFs for object class image segmentation,” in Proc. ICCV,Sep./Oct. 2009, pp. 739–746.

[21] J. Lafferty, A. McCallum, and F. C. Pereira, “Conditional random fields:Probabilistic models for segmenting and labeling sequence data,” inProc. ICML, 2001, pp. 282–289.

[22] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holisticrepresentation of the spatial envelope,” Int. J. Comput. Vis., vol. 42,no. 3, pp. 145–175, 2001.

[23] S. Rital, “Hypergraph cuts & unsupervised representation for imagesegmentation,” Fundam. Inf., vol. 96, no. 1, pp. 153–179, 2009.

[24] J. Shi and I. Malik, “Normalized cuts and image segmentation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905,Aug. 2000.

[25] M. Song, D. Tao, C. Chen, J. Bu, J. Luo, and C. Zhang, “Proba-bilistic exposure fusion,” IEEE Trans. Image Process., vol. 21, no. 2,pp. 341–357, Jan. 2012.

[26] D. Song and D. Tao, “Biologically inspired feature manifold for sceneclassification,” IEEE Trans. Image Process., vol. 19, no. 1, pp. 174–184,Jan. 2010.

[27] L. Zhang, M. Song, Z. Liu, X. Liu, J. Bu, and C. Chen, “Probabilisticgraphlet cut: Exploiting spatial structure cue for weakly supervisedimage segmentation,” in Proc. CVPR, Jun. 2013, pp. 1908–1915.

[28] L. Zhang, M. Song, Q. Zhao, X. Liu, J. Bu, and C. Chen, “Probabilisticgraphlet transfer for photo cropping,” IEEE Trans. Image Process.,vol. 22, no. 2, pp. 2887–2897, Feb. 2013.

[29] M. A. Stricker and M. Orengo, “Similarity of color images,” inProc. IS&T/SPIE’s Symp. Electron. Imag., Sci. Technol., Mar. 1995,pp. 381–392.

[30] J. Shotton, J. Winn, C. Rother, and A. Criminisi, “Textonboost: Jointappearance, shape and context modeling for multi-class object recogni-tion and segmentation,” in Proc. ECCV, 2006, pp. 1–15.

[31] Y. Luo, D. Tao, B. Geng, C. Xu, and S. J. Maybank, “Man-ifold regularized multitask learning for semi-supervised multilabelimage classification,” IEEE Trans. Image Process., vol. 22, no. 2,pp. 523–536, Feb. 2013.

[32] M. Shi, R. Xu, D. Tao, and C. Xu, “W-tree indexing for fast visual wordgeneration,” IEEE Trans. Image Process., vol. 22, no. 3, pp. 1209–1222,Mar. 2013.

[33] W. Liu and D. Tao, “Multiview hessian regularization for image anno-tation,” IEEE Trans. Image Process., vol. 22, no. 7, pp. 2676–2687,Jul. 2013.

[34] J. Yu, M. Wang, and D. Tao, “Semisupervised multiview distance metriclearning for cartoon synthesis,” IEEE Trans. Image Process., vol. 21,no. 11, pp. 4636–4648, Nov. 2012.

[35] A. Vezhnevets and J. M. Buhmann, “Towards weakly supervised seman-tic segmentation by means of multiple instance and multitask learning,”in Proc. CVPR, Jun. 2010, pp. 3249–3256.

[36] A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly supervisedsemantic segmentation with a multi-image model,” in Proc. ICCV,Nov. 2011, pp. 643–650.

[37] J. Verbeek and B. Triggs, “Region classification with Markov field aspectmodels,” in Proc. CVPR, Jun. 2007, pp. 1–8.

[38] L. Zhang, Y. Gao, Y. Xia, K. Lu, J. Shen, and R. Ji, “Representativediscovery of structure cues for weakly-supervised image segmentation,”IEEE Trans. Multimedia, vol. 16, no. 2, pp. 470–479, Feb. 2014.

[39] A. Vezhnevets, J. M. Buhmann, and V. Ferrari, “Active learningfor semantic segmentation with expected change,” in Proc. CVPR,Jun. 2012, pp. 3162–3169.

[40] A. Vezhnevets, V. Ferrari, and J. M. Buhmann, “Weakly supervisedstructured output learning for semantic segmentation,” in Proc. CVPR,Jun. 2012, pp. 845–852.

[41] X. Wang, Z. Li, and D. Tao, “Subspaces indexing model on Grassmannmanifold for image search,” IEEE Trans. Image Process., vol. 20, no. 9,pp. 2627–2635, Sep. 2011.

[42] S. Xiang, F. Nie, Y. Song, C. Zhang, and C. Zhang, “Embedding newdata points for manifold learning via coordinate propagation,” Knowl.Inf. Syst., vol. 19, no. 2, pp. 159–184, 2008.

Luming Zhang (M’14) is a Post-Doctoral Research Fellow with the School ofComputing, National University of Singapore, Singapore. His research inter-ests include multimedia analysis, image enhancement, and pattern recognition.

Yi Yang received the Ph.D. degree in computer science from ZhejiangUniversity, Hangzhou, China, in 2010. He is currently a DECRA Fellowwith the University of Queensland, Brisbane, QLD, Australia. Prior to that,he was a Post-Doctoral Research Fellow with the School of ComputerScience, Carnegie Mellon University, Pittsburgh, PA, USA. His researchinterests include machine learning and its applications to multimedia contentanalysis and computer vision, such as multimedia indexing and retrieval,image annotation, and video semantics understanding.

Yue Gao (SM’14) received the B.S. degree from the Harbin Institute ofTechnology, Harbin, China, and the M.E. and Ph.D. degrees from TsinghuaUniversity, Beijing, China.

Yi Yu received the Ph.D. degree in computer science from Nara WomensUniversity, Nara, Japan, in 2009. She was with different institutions, includingthe New Jersey Institute of Technology, Newark, NJ, USA, the Universityof Milan, Milan, Italy, and Nara Womens University. She is currently withthe School of Computing, National University of Singapore, Singapore.Her research interests include social interactions over geo-aware multimediastreams, multimedia/music signal processing, audio classification and tagging,locality sensitive hashing-based music information retrieval, and pest soundclassification. She was a recipient of the Best Paper Award from the 2012IEEE International Symposium on Multimedia.

Changbo Wang received the Ph.D. degree from the State Key Laboratory ofCAD&CG, Zhejiang University, Hangzhou, China, in 2006. He is currentlya Professor with the Software Engineering Institute, East China NormalUniversity, Shanghai, China. He was a Visiting Scholar with the StateUniversity of New York, Stony Brook, NY, USA, from 2009 to 2010. Hisresearch interests include computer graphics, information visualization, andvirtual reality.

Xuelong Li (M’02–SM’07–F’12) is a full professor with the Center forOPTical IMagery Analysis and Learning (OPTIMAL), State Key Laboratoryof Transient Optics and Photonics, Xi’an Institute of Optics and Preci-sion Mechanics, Chinese Academy of Sciences, Xi’an 710119, Shaanxi,P. R. China.