HarmoniumModelsforSemanticVideoRepresentationand Classi …epxing/papers/sdm07jyang.pdf · † A video shot s is represented by a tuple as (x;z;h;y), which respectively denote the

Harmonium Models for Semantic Video Representation andClassi�cation

Jun Yang Yan Liu∗ Eric X. Ping Alexander G. HauptmannSchool of Computer ScienceCarnegie Mellon University

Pittsburgh, PA 15213juny, yanliu, epxing, [email protected]

AbstractAccurate and e�cient video classi�cation demands thefusion of multimodal information and the use of interme-diate representations. Combining the two ideas into theone framework, we propose a probabilistic approach forvideo classi�cation using intermediate semantic repre-sentations derived from multi-modal features. Based ona class of bipartite undirected graphical models namedharmonium, our approach represents the video data aslatent semantic topics derived by jointly modeling thetranscript keywords and color-histogram features, andperforms classi�cation using these latent topics under auni�ed framework. We show satisfactory classi�cationperformance of our approach on a benchmark datasetas well as interesting insights into the data.

1 IntroductionClassifying video data into semantic categories, some-times known as semantic video concept detection, is animportant research topic. Video data contain multi-ple data types including image frames, transcript text,speech, audio signal, each bearing correlated and com-plementary information essential to the analysis and re-trieval of video data. The fusion of such multimodalinformation is regarded as a key research problem [10],and has been a widely used technique in video clas-si�cation and retrieval methods. Many fusion strate-gies have been proposed, varying from early fusion [12],which merges the feature vectors extracted from di�er-ent modalities, to late fusion, which combines the out-puts of the classi�ers or �retrieval experts� built on eachsingle modality [12, 6, 18, 15]. Empirical results showthat the methods based on the fusion of multimodal in-formation outperforms those based on any single typeof information in both classi�cation and retrieval tasks.

∗Now a�liated with IBM T. J. Watson Research Center,Yorktown Heights, NY ([email protected])

Another trend in video classi�cation is the search oflow-dimensional, intermediate representations of videodata. Its primary motivation is to make sophisticatedclassi�ers (e.g, SVM) a�ordable, which otherwise wouldbe computationally expensive on the high-dimensionalraw features. Moreover, using intermediate representa-tions holds the promise of better interpretation of thedata semantics, and may lead to superior classi�cationperformance. Related work along this direction includesthe conventional dimension-reduction methods such asprincipal component analysis (PCA) and Fisher lineardiscriminant (FLD) [4], as well as probabilistic methodssuch as probabilistic latent semantic indexing (pLSI) [5],latent Dirichlet allocation (LDA) [2], exponential-familyharmonium (EFH) [14]. While many of these models areinitially developed for single-modal data such as textualdocuments only, some extensions have been studied re-cently in order to model multi-modal data such as cap-tioned images and video [1, 17].

The key insights for video classi�cation from previ-ous works appear to be combining multimodal informa-tion and using intermediate representations. Thereforethe goal of this paper is to take advantage of both in-sights through an integrated and principled approach.Based on a class of bipartite, undirected graphical mod-els (i.e., random �elds) called harmonium [14, 17], ourapproach extracts intermediate representation as latentsemantic topics of video data by jointly modeling thecorrelated information in image regions and transcriptkeywords. Moreover, this approach explicitly introducescategory label(s) into the model, which allows the clas-si�cation and representation to be accomplished in auni�ed framework.

The proposed approach di�ers signi�cantly fromprevious models for text/multimedia data in that it in-corporates category labels as (hidden) model variables,in addition to the variables representing data (features)and latent semantic topics. This allows us to classify

unlabeled data by directly inferencing the distributionof the label variables conditioned on the observed datavariables. In contrast, existing models [2, 1, 5, 14, 17]are solely focused on deriving the intermediate repre-sentations in terms of latent semantic topics. One hasto build a separate classi�er on top of the derived in-termediate representations if classi�cation results areneeded. Therefore, one major advantage of our ap-proach is unifying both representation and classi�ca-tion in one model, which avoids the additional steps tobuild separate classi�ers. More importantly, by con-sidering the interactions between latent semantic top-ics and category labels, our approach may be able tolearn better intermediate representations so as to re-�ect the category information from the data. Such �su-pervised� intermediate representations are expected toprovide more discriminative power and insights of thedata than the �unsupervised� representations generatedby existing methods [2, 1, 5, 14, 17].

Our proposal includes two related models, eachbearing di�erent implications to the representation andclassi�cation of the video data. Family-of-harmonium(FoH) builds a family of category-speci�c harmoniummodels, with each modeling the video data from onespeci�c category. The label of a video shot is predictedby comparing its likelihood against each harmoniummodel. Hierarchical harmonium (HH) treats the cat-egory labels as an additional layer of hidden variablesinto a single harmonium model, and performs classi�ca-tion through the inference of these label variables. TheFoH model reveals the internal structure of each cate-gory, and can be easily extended to include new cate-gories without retraining the whole model. In contrast,the HH model reveals the relationships between multi-ple categories, and takes advantage of such relationshipsin classi�cation.

In Section 2 we review the related work on the fu-sion of multimodal video features as well as representa-tion models for video data. We describe the two pro-posed models in Section 3, and discuss their learningalgorithms in Section 4. In Section 5, we show the ex-periment results and illustrate interesting interpretationof the data from TRECVID video collection. The con-clusions and future work are discussed in Section 6.

2 Related WorksAs pointed out in [10], the processing, indexing, andfusion of the data in multiple modalities is a core prob-lem of multimedia research. For video classi�cationand retrieval, the fusion of features from multiple datatypes (e.g., key-frames, audio, transcript) allows themto complement each other and achieve better perfor-mance than using any single type of feature. This idea

has been widely used in many existing methods. The fu-sion strategies vary from early fusion [12], which mergesthe feature vectors extracted from di�erent data modal-ities, to late fusion, which combines the output of classi-�ers or �retrieval experts� built on each single modality[12, 6, 18, 15]. It remains an open question as to whichfusion strategy is more appropriate for a certain task,and a comparison of the two strategies in video clas-si�cation is presented in [12]. The approach presentedin this paper takes neither approach; instead, it derivesthe latent semantic representation of the video data byjointly modeling the multimodal low-level features, sothat the fusion takes place somewhere between early fu-sion and late fusion.

There are many approaches to obtaining low-dimensional intermediate representations of video data.Principal component analysis (PCA) has been the mostpopular method, which projects the raw features intoa lower-dimensional feature space where the data vari-ances are well preserved. Independent componentanalysis (ICA) and Fisher linear discriminant (FLD) arewidely-used alternatives for dimension reduction. Re-cently, there are also many studies on modeling the la-tent semantic topics of the text and multimedia data.For example, latent semantic indexing (LSI) by Deer-wester et al. [3] transforms term counts linearly intoa low-dimensional semantic eigenspace, and the ideawas later extended by Hofmann to probabilistic LSI(pLSI) [5]. The latent Dirichlet allocation (LDA) byBlei et al. [2] is a directed graphical model that pro-vides generative semantics of text documents, whereeach document is associated with a topic-mixing vectorand each word is independently sampled according to atopic drawn from this topic-mixing. LDA has been ex-tended to Gaussian-Mixture LDA (GM-LDA) and Cor-respondence LDA (Corr-LDA) [1], both of which areused to model annotated data such as captioned im-ages or video with transcript text. Exponential-familyharmonium (EFH) proposed by Welling et al. [14] is bi-partite undirected graphical model consisting a layer oflatent nodes representing semantic aspects and a layer ofobserved nodes representing the raw features. To modelmulti-modal data, Xing et al. [17] have extended it tothe multi-wing harmonium model where the data layerconsists of two or more �wings� of nodes representingtextual, imagery, and other types of data, respectively.

In practice, the methods mentioned above aremainly used for transforming the high-dimensional rawfeatures into a low-dimensional representation whichpresumably capture the latent semantics of the data.Classi�cation task is usually performed by building aseparate discriminative classi�er (e.g., SVM) based onsuch latent semantic representations. In this paper, we

��

� ��

!"#$%&' ( !"#$%&' ) *+,!-#,#. /0%"

1%.#, ( 1%.#, )

2!34,' %5 0!&3%+4*3

6789:9:; 6<=69:;

>&#.4 "?!"#$%&'@!-#,

>&#.4 "?!"#$%&'@!-#,

&#$4%+A-!/#. %,%& 04/"%$&!3

B#'C%&./DEFGHIJKLMLNOMHPFQ

R �� S�� R�TT�� R�� S��U� ��

VWXYZ[\]YZ\X^]V_` a_\

b��T�� c�T�d ��e ��R��

Figure 1: A sketch of our approach

seek for one uni�ed approach in which the representa-tion and classi�cation can be integrated into the sameframework. This approach not only achieves satisfac-tory classi�cation performance, but also provides inter-esting insights into the data semantics, such as the in-ternal structure of each category and the relationshipsbetween di�erent categories. Fei-Fei et al. [8] used auni�ed model for representing and classifying naturalscene images by introducing category variables into theLDA model, which is similar to our approach exceptthat our models are undirected.

3 Our ApproachA sketch of our approach is illustrated in Figure 1. Thedata to be classi�ed are called video shots, namely videosegments with length varying from a few seconds tohalf minute or even longer. We represent each videoshot as a bag of keywords (extracted from the videoclosed-captions or via speech recognition systems), anda set of �xed-sized image regions (extracted from thekeyframe of the video shot). Each region is describedby its color histogram feature. In the training phase,our goal is to learn a model that best describes thejoint distribution of the keywords and color features ofthe video shots in each category. During testing phase,we extract the keywords and color features from anunlabeled video shot, and then use them as featuresto predict which category this shot belongs to. Our two

proposed models, family-of-harmonium and hierarchicalharmonium, di�er in the way that the data are modeledand classi�ed.

Both of our models are based on a class of bipar-tite undirected model (i.e., random �elds) called har-monium, which has been used by Welling et al. [14]and Xing et al. [17] to model text and multimedia data.Our models use their models as the basic building block,but di�er from theirs by explicitly incorporating the cat-egory labels into the model. This allows our model torepresent and classify video data in a uni�ed framework,while the previous harmonium models are only for datarepresentation.

3.1 Notations and de�nitions The notations usedin the paper follow the convention of probabilistic mod-els. Uppercase characters represent random variables,while lowercase characters represent the instances (val-ues) of the random variables. Bold font is used to indi-cate a vector of random variables or their values. In theillustrations, shaded circles represent observed nodeswhile un�lled circles represent hidden (latent) nodes.Each node in a graphical model is associated with arandom variable, and we use the term node and vari-able interchangeably in this paper.

The semantics of the model variables are describedbelow:• A video shot s is represented by a tuple as

(x, z,h,y), which respectively denote the key-words, region-based color features, latent semantictopics, and category labels of the shot.

• The vector x = (x1, ..., xN ) denotes the keywordfeature extracted from the transcript associatedwith the shot. Here N is the size of the wordvocabulary, and xi ∈ {0, 1} is a binary variablethat indicates the absence or presence of the ith

keyword (of the vocabulary) in the shot.• The vector z = (z1, ..., zM ) denotes color-histogram

features of the keyframe in the shot. Each keyframeis evenly divided into a grid of totally M �xed-sized rectangular regions, and zj ∈ RC is a C-dimensional vector that represents the color his-togram of the jth region. So z is a stacked vectorof length equal to CM .

• The vector h = (h1, ..., hK) represents the latentsemantic topics of the shot, where K is the totalnumber of the latent topics. Each componenthk ∈ R denotes how strongly this shot is associatedwith the kth latent topic.

• The category labels of a shot are modeled di�er-ently in the two models. In family-of-harmonium,

��

��

� � � � � ��

��

� � � � � �

�

),,,( 2222 UWββββαααα ),,,( TTTT UWββββαααα��

��

��

� � � � � �

),,,( 1111 UWββββαααα

Figure 2: The family-of-harmonium model

a single variable y ∈ {1, ..., T} indicates the cate-gory this shot belongs to, where T is the total num-ber of categories. In hierarchical harmonium, thelabels are represented by a vector y = (y1, ..., yT ),with each yt ∈ {0, 1} denoting whether the shot isin the tth category. Here a video shot belongs toonly one category, so we have

∑t yt = 1.

• The two proposed models have di�erent sets of pa-rameters. The family-of-harmonium has a speci�charmonium model for each category y, with para-meters as θy = (πy, αy, βy,W y, Uy). The hierar-chical harmonium has a single set of parameters asθ = (α, β, τ,W,U, V ).

3.2 Family-of-harmonium (FoH) The FoH modelis illustrated in Figure 2. It contains a set of Tcategory-speci�c harmoniums, with each harmoniummodeling the video data from a speci�c category. Eachharmonium is a bipartite undirected graphical modelthat consists of two layers of nodes. Nodes in the toplayer represent the latent semantic topics H = {Hk}of the data. To represent the bi-modal features ofvideo data, the bottom layer contains two �wings� ofobserved nodes that represent the keyword featuresX = {Xi} and region-based color features Z = {Zj},respectively. Each node is linked with all the nodesin the opposite layer, but not with any node of thesame layer. This topology ensures that the nodesin one layer are conditionally independent given thenodes in the opposite layer, a property important tothe construction and inference of the model. All thecomponent harmoniums in FoH share exactly the samestructure, but each has a unique set of parametersθy = (πy, αy, βy,W y, Uy) indexed by the category labely.

We now describe the distributions of these variables.

The category label Y follows a prior multinomial distri-bution:(3.1) p(y) = Multi (π1, ..., πT ),

where∑T

t=1 πt = 1. In FoH, Y is not actually linkedwith any nodes in the component harmoniums; instead,it serves as an indicator variable for us to select aspeci�c harmonium for modeling the video data of thatparticular category. In the distribution function of eachharmonium, Y only appears as the subscript of themodel parameters.

Given its category label y, we consider the raw fea-tures of a shot as well as its latent semantic topics as twolayers of representations mutually in�uencing each otherin the speci�c harmonium associated with this category.We can either conceive keyword and color features asbeing generated by the latent semantic topics, or con-ceive the semantic topics as being summarized from thekeyword and image features. This mutual in�uence isre�ected in the conditional distributions of the variablesrepresenting the features and the semantic topics.

For the keyword feature, the variable xi indicatingthe presence/absence of term i ∈ {1, ..., N} in thevocabulary follows a distribution as:

P (Xi = 1|h, y) =1

1 + exp(−αyi −

∑k W y

ikhk)(3.2)

P (Xi = 0|h, y) = 1− P (Xi = 1|h, y)

This shows that each keyword in a video shot is sampledfrom a Bernoulli distribution dependent on the latentsemantic topics h. That is, the probability whether akeyword appears is a�ected by a weighted combinationof semantic topics h. Parameter αy

i and W yik are both

scalars, so αy = (αy1 , ..., αy

N ) is an N -dimensional vector,and W y = [W y

ik] is a matrix of size N ×K. Due to theconditional independence between xi given h, we havep(x|h, y) =

∏i p(xi|h, y).

The color-histogram feature zj of the jth region inthe keyframe of the shot admits a conditional multivari-ate Gaussian distribution as:

(3.3) p(zj |h, y) = N (zj |Σyj (βy

j +∑

k

Uyjkhk), Σy

j )

where zj is sampled from a distribution parameterizedby the latent semantic topics h. Here, both βy

j andUy

jk are C-dimensional vectors, and therefore βy =(βy

1 , ..., βyM ) is a stacked vector of dimension CM and

Uy = [Uyjk] is a matrix of size CM × K. Note that

Σyj is a C × C covariance matrix, which, for simplicity,

is set to identity matrix I in our model. Again,we have p(z|h, y) =

∏j p(zj |h, y) due to conditional

independence.Finally, each latent topic variable hj follows a

conditional univariance Gaussian distribution whosemean is determined by a weighted combination of thekeyword feature x and the color feature z:

(3.4) p(hk|x, z, c) = N (hk|∑

i

W yikxi +

∑

j

Uyjkzj , 1)

where W yik and Uy

jk are the same parameters usedin Eq.(3.2) and (3.3). Similarly, p(h|x, z, y) =∏

k p(hk|x, z, y) holds.So far we have presented the conditional distribu-

tions of all the variables in the model. These local con-ditionals can be mapped to the following harmoniumrandom �elds as:

p(x, z,h|y) ∝ exp{ ∑

i αyi xi +

∑j βy

j zj −∑

j

z2j

2(3.5)

−∑k

h2k

2 +∑

ik W yikxihk +

∑jk Uy

jkzjhk

}

We present the detailed derivation for this random �eldin the Appendix. Note that the partition function(global normalization term) of this distribution is notexplicitly shown, so we use a proportional sign instead ofan equal sign. This hidden partition function increasesthe di�culty of learning the model.

By integrating out the hidden variables h inEq.(3.5), we obtain the category-conditional distribu-tion over the observed keyword and color features of avideo shot:

p(x, z|y) ∝ exp{ ∑

i αyi xi +

∑j βy

j zj −∑

j

z2j

2(3.6)

+ 12

∑k(

∑i W y

ikxi +∑

j Uyjkzj)2

}.

There is also a hidden partition function in thisdistribution. The marginal distribution (likelihood) of a

��

��

� � � � � �

��

Figure 3: Hierarchical harmonium model

labeled video shot can be decomposed into a category-speci�c marginal and a prior over the categories, i.e.,p(x, z, y) = p(x, z|y)p(y).

The learning of FoH involves learning T compo-nent harmoniums, with each harmonium learned inde-pendently using the (labeled) video shots from the cor-responding category. To learn the harmonium modelfor a category y, we estimate its model parametersθy = (αy, βy,W y, Uy) by maximizing the likelihood ofthe video shots in category y, where the likelihood func-tion is de�ned by Eq.(3.6). Due to the existence of par-tition function, the learning requires approximate infer-ence methods, which we will further discuss in Section4.

The category of an unlabeled shot is predicted by�nding the component harmonium that best describesthe features of the shot. Given the keyword feature xand color feature z of a shot, we compute the posteriorprobability of each category label as:

(3.7) p(y|x, z) ∝ p(x, z|y)p(y) ∝ p(x, z|y)

The second step in the derivation assumes that the cat-egory prior is a uniform distribution, e.g., p(y) = 1/T .Eq.(3.7) indicates that we can predict the category of ashot by comparing its likelihood p(x, z|y) in each of thecategory-speci�c harmoniums computed by Eq.(3.6).The harmonium that best �ts the shot determines itscategory (here we adopts similar idea of generative clas-si�ers, such as naive Bayes, except that we assume equalprior for all categories).

3.3 Hierarchical harmonium (HH) The secondproposed model, hierarchical harmonium, adopts a dif-ferent way of incorporating category labels into the basicharmonium model. Instead of building a separate har-monium for each category, it introduces the category la-bels as another layer of nodes Y = {Y1, ..., YT } into onesingle harmonium, with Yt ∈ {0, 1} indicating a shot'smembership with category t. As illustrated in Figure 3,

these label variables Y form a bipartite subgraph withthe latent topic nodes H. There is a link between any Yt

and Hj but not between two Yt, which are conditionallyindependent given H. Unlike FoH, there is only a singlehierarchical harmonium in this model.

In the HH model, the conditional distribution ofx and z stay the same as those in the FoH model,which are de�ned by Eq.(3.2) and Eq.(3.3), respectively.The only di�erence is that the model parameters θ =(α, β, τ, W,U, V ) no longer depend on category labels.The label variable Yi follows a Bernoulli distribution as:

P (Yt = 1|h) =1

1 + exp(−τt −∑

k Vtkhk)(3.8)

P (Yt = 0|h) = 1− P (Yt = 1|h)

where V = [Vtk] is a matrix of size T × K. Notethat if we treat h as input, Vtk and τ as parameters,this distribution has exactly the same form as thedistribution of the class label in logistic regression [4],i.e., P (Y = 1|x) = 1/(1 + exp(−β0 − βT x)). Thisimplies that the model is actually performing logisticregression to compute each category label Yt using thelatent semantic topics h as input.

The distribution of each latent topic variable hk

needs to be modi�ed to incorporate the interactionsbetween label variables y and the topic variables h:

p(hk|x, z,y) =(3.9)N (hk|

∑

i

Wikxi +∑

j

Ujkzj +∑

t

Vtkyt, 1)

Therefore, the distribution of the latent semantic topicsare not only a�ected by the data features x and z, butalso by their labels y. This is signi�cantly di�erentfrom existing harmonium models [14, 17] in which thedistribution of latent topics depend on the observedfeatures only.

With the incorporation of label variables, the ran-dom �eld of hierarchical harmonium becomes:

p(x, z,h,y) ∝ exp

�Xi

αixi +X

j

βjzj −X

j

z2j

2+X

t

τtyt

(3.10)

−X

k

h2k

2+X

ik

Wikxihk +X

jk

Ujkzjhk +X

tk

Vtkythk

�

After integrating out the hidden variable H, themarginal distribution of a labeled video shot (x, z,y)is:

p(x, z,y) ∝ exp

�Xi

αixi +X

j

βjzj −X

k

z2k

2+X

t

τtyt

(3.11)

+1

2

X

k

(X

i

Wikxi +X

j

Ujkzj +X

t

ytVtk)2�

.

The parameters of the HH model, θ =(α, β, τ,W,U, V ), are estimated by maximizing the like-lihood function de�ned by Eq.(3.11). The classi�cationis performed in a very di�erent way in HH. To predictthe category of an unlabeled video shot, we need to in-fer the unknown label variables Y of the shot, from itskeyword and color features. This is done by computingthe conditional probability p(Yt = 1|x, z) for each labelvariable Yt. The category that gives the highest con-ditional probability is predicted as the category of theshot:

(3.12) t∗ = argmaxtp(Yt = 1|x, z)

There is, however, no analytical solution to this condi-tional probability. Various approximate inference meth-ods are available to solve this problem, as further dis-cussed in Section 4.

3.4 Model comparison We compare our modelswith other existing models for text and multimedia dataanalysis, including pLSI [5], LDA [2] and its variantsGM-LDA and Corr-LDA [1], exponential-family harmo-nium [14, 17]. First of all, our models not only derive thelatent semantic representation of the data but also per-form classi�cation within the same framework. In con-strast, all the models above are only intended for datarepresentation and therefore separate classi�ers need tobe trained for the classi�cation task. This is not neces-sarily a theoretical advantage of our approach, but doesprovide a more integrated and cleaner setting, whichpresumably leads to superior performance and betterdata interpretation. The Bayesian hierarchical model,an extention of the LDA model with similar ideas, hasdemonstrated strong empirical improvement for sceneclassi�cation [8]. Second, in our models the categorylabels �supervise� the derivation of latent semantic rep-resentation. As a result, the derived representation re-�ects not only the characteristics of the underlying databut also the category information, which is di�erentfrom the �unsupervised� derivation in all the other mod-els. The third issue is the choice between directed andundirected models. The harmonium models [14, 17],including the ones proposed in this paper, are all undi-rected models, while the rest are directed ones. Thereare no conclusions on which version is better. In undi-rected models, inferences are much easier due to condi-tional independence of hidden variables, but learning isusually harder due to the global normalization term.

There are also several interesting observations whenwe make comparisons between the two proposed models.First, they di�er in the semantics of the learnt latenttopics. In FoH, each harmonium model is built for aspeci�c category, and therefore the latent topics in each

harmonium capture the internal structure of the datain that category, i.e., they represent the themes or datasub-clusters in that particular category. There are nocorrespondences between the semantic topics across dif-ferent harmoniums: the �rst topic in one harmonium isunrelated to the �rst topic in another. In contrast, HHhas a single set of latent semantic topics derived fromthe data in various categories. These semantic topics arehowever di�erent from those learned by other represen-tation models, as they are �supervised� by the categorylabels and presumably contain more discriminative in-formation. Sharing a single semantic representation alsohelps to reveal the connections and di�erences betweenmultiple categories. The two models also di�er in termsof scalability. FoH can easily accommodate a new cat-egory by adding another harmonium trained from thedata of this new category, without any changes to otherexisting harmoniums. However, introducing a new cat-egory into HH means adding another (label) node intothe model, which requires re-training of the whole modelsince its structure is changed.

4 Learning and inferenceThe parameters of our models, namely (αy, βy,W y, Uy)in the FoH model and (α, β, τ, W,U, V ) in the HHmodel,can be estimated by maximizing the data likelihood.However, there is no closed-form solution to the parame-ters in complex models like ours, and therefore iterativesearching algorithm has to be applied. As an example,we discuss the learning and inference algorithms for theHH model. The learning and inference of each compo-nent harmonium in the FoH model can be easily derivedaccordingly.

As described in the previous section, the log-likelihood of the data under the HH model is de�nedby Eq.(3.11). By taking derivatives of the log-likelihoodfunction w.r.t the parameters, we have the following gra-dient learning rules:

δαi = 〈xi〉p − 〈xi〉pδβj = 〈zj〉p − 〈zj〉pδτt = 〈yt〉p − 〈yt〉p

δWik =⟨xih

′k

⟩p− ⟨

xih′k

⟩p

δUjk =⟨zjh

′k

⟩p− ⟨

zjh′k

⟩p

δVtk =⟨yth

′k

⟩p− ⟨

yth′k

⟩p

(4.13)

where h′k =∑

i Wikxi +∑

j Ujkzj +∑

t Vtkyt, and 〈·〉pand 〈·〉p denotes expectation under empirical distrib-ution (i.e., data average) or model distribution of theharmonium, respectively. Like other undirected graph-ical models, there is a global normalizer term in the

likelihood function of harmonium, which makes the di-rect computing of 〈·〉p intractable. Therefore, we needapproximate inference methods to estimate these modelexpectations 〈·〉p. We explored four methods which arebrie�y discussed below. The conditional distributionof the label nodes p(Yt = 1|x, z), which is needed forpredicting class labels, is also computed using these ap-proximate inference methods.

4.1 Mean �eld approximation Mean �eld (MF)is a variational method that approximates the modeldistribution p through a factorized form as a productof marginals over clusters of variables [16]. We usethe naive version of MF, where the joint probabilityp is approximated by an surrogate distribution q as aproduct of singleton marginals over the variables:

q(x, z,y,h) =∏

i

q(xi|νi)∏

j

q(zj |µj , I)∏

t

q(yt|λt)∏

k

q(hk|γk)

where the singleton marginals are de�ned as q(xi) ∼Bernoulli (νi), q(zj) ∼ N(µj , I), q(yt) ∼ Bernoulli (λt),and q(hk) ∼ N(γk, 1), and {νi, µj , λt, γk} are varia-tional parameters. The variation parameters can becomputed by minimizing the KL-divergence between pand q, which results in the following �xed-point updat-ing equations:

νi = σ(αi +∑

k

Wikγk)

µj = βj +∑

k

Ujkγk

λt = σ(τt +∑

k

Vtkγk)

γk =∑

i

Wikvi +∑

j

Ujkµj +∑

t

Vtkλt

where σ(x) = 1/(1 + exp(−x)) is the sigmoid funciton.After the �xed-point equations converge, the surrogatedistribution q is fully speci�ed by the converged varia-tional parameters. We replace the intractable 〈·〉p with〈·〉q in Eq.(4.13), which is easy to compute from the fullyfactorized q. Note that after each iterative searchingstep in Eq.(4.13), we need to recompute the variationalparameters in q since the model parameters of p havebeen updated.

4.2 Gibbs sampling Gibbs sampling, as a spe-cial form of the Markov chain Monte Carlo (MCMC)method, has been used widely for approximate infer-ence in complex graphical models [7]. This method re-peatedly samples variables in a particular order, with

one variable at a time and conditioned on the currentvalues of the other variables. For example in our hierar-chical harmonium model, we de�ne the sampling orderas y1, . . . , yT , h1, . . . , hK , and then sample each yt fromthe conditional distribution de�ned in Eq.(3.8) using thecurrent values of hj , �nally sample each hj according toEq.(3.9). After a large number of iterations (�burn-in�period), this procedure guarantees to reach an equilib-rium distribution that in theory is equal to the modeldistribution p. Therefore, we use the empirical expecta-tion computed using the Gibbs samples collected afterthe burn-in period to approximate the true expectation〈·〉p.

4.3 Contrastive divergence An alterative to exactgradient ascent search based on the learning rules inEq.(4.13) is the contrastive divergence (CD) algorithm[13] proposed by Hinton and Welling that approximatesthe gradient learning rules. In each step of the gradi-ent update, instead of computing the model expectation〈·〉p, CD starts from the empircal values as the initialsamples, runs the Gibbs sampling for up to only a fewiterations and uses the resulting distribution q to ap-proximate the model distribution p. It has been provedthat the �nal values of the parameters by this kind ofupdating will converge to the maximum likelihood es-timation [13]. In our implementation, we compute 〈·〉qfrom a large number of samples obtained by runningonly one step of Gibbs sampling with di�erent initial-izations. Straightforwardly, CD is signi�cantly more ef-�cient than the Gibbs sampling method since the �burn-in� process is skipped.

4.4 The uncorrected Langevin method The un-corrected Langevin method [9] is originated from theLangevin Monte Carlo method by accepting all the pro-posal moves. It makes use of the gradient informationand resembles noisy steepest ascent to avoid local op-timal. Similar to the gradient ascent, the uncorrectedLangevin algorithm has the following update rule:

(4.14) λnewij = λij +ε2

2∂

∂λijlog p(X,λ) + εnij

where nij ∼ N (0, 1) and ε is the parameter to controlthe step size. Like the contrastive divergence algorithm,we use only a few iterations of Gibbs sampling toapproximate the model distribution p.

5 ExperimentsWe evaluate the proposed models using video data fromthe TRECVID 2003 development set [11]. Based onthe manual annotations on this set, we choose 2468shots that belong to 15 semantic categories, which are

��

��

��

��

��

��

��

��

�� !

�� "

Figure 4: The representative images and keywords of 5latent topics derived from the data in category �Fire�

airplane, animal, baseball, basketball, beach, desert, �re,football, hockey, mountain, o�ce, road tra�c, skating,studio, and weather news. Each shot belongs to onlyone category. The size of a category varies from 46 to373 shots. The keywords of each shot are extractedfrom the video closed-captions associated with thatshot. By removing non-informative words such as stopwords and less frequent words, we reduce the totalnumber of distinct keywords (vocabulary size) to 3000.Meanwhile, we evenly divide the key-frame of each shotinto a grid of 5×5 regions, and extract a 15-dimensionalcolor histogram on HVC color space from each region.Therefore, each video shot can be represented by a 3000-d keyword feature and a 375-d color histogram feature.For simplicity, the keyword features are made binary,meaning that they only capture the presence/absenceinformation of each keyword, because it is rare to see akeyword appears multiple times in the short duration ofa shot.

The experiment results are presented in two parts.First, we show some illustrative examples of the latentsemantic topics derived by the proposed models anddiscuss the insights they provide about the structureand relationships of video categories. In the secondpart, we evaluate the performance of our models in videoclassi�cation in comparison with some of the existingapproaches.

��

��

��

��

��

��

�� !

�� "

�� #

�� $

Figure 5: The representative images and keywords of 5latent topics derived from the whole data set

5.1 Interpretation of latent semantic topicsBoth the family-of-harmonium (FoH) and the hierarchi-cal harmonium (HH) model derive latent semantic top-ics as intermediate representation of video data. Sinceeach harmonium in FoH is learned independently fromthe data of a speci�c category, its latent topics are ex-pected to capture the structure of that particular cat-egory. To show these topics are meaningful, in Figure4 we illustrate 5 latent topics learned from the videocategory �Fire� by showing the keywords and imagesassociated with 5 video shots that have the highest con-ditional probability given each latent topic. As we cansee, the 5 topics roughly correspond to 5 sub-categoriesunder the category ��re�, which can be described as �for-est �re in the night�, �explosion in outer space�, �launchof missile or space shuttle�, �smoke of �re�, and �close-up scene of �re�. Since these latent topics are derivedby jointly modeling the textual and image features ofthe video data, they are more than simply clusters incolor or keyword feature space, but sort of �co-clusters�in both feature spaces. For example, the shots of Topic1 are very similar to each other visually; the shots ofTopic 2 are not so similar visually, but it is clear thatthey have very close semantic meanings and share com-mon keywords such as ��ight� and �radar�. The key-words associated with Topic 5 seem to be irrelevant at�rst glance, but later we �nd that these shots containthe scenes from a movie, which explains the occurrence

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

1.fire

2.football

3.moutain

4.studio

5.airplane

6.animal

7.baseball

8.basketball

9.beach

10.desert

11.hockey

12.skating

13.office

14.traffic

15.weather

Figure 6: The color-coded matrix showing the pairwisesimilarity between categories. Best viewed with color.

of keywords like �love�, �freedom�, and �beautiful�.We also illustrate 5 latent topics out of a set

of 20 topics learned in the HH model in Figure 5.Note that these topics are learned from the wholedata set instead of the data from one category, sothey are expected to represent some high-level semantictopics. We can see that these 5 topics are about�studio�, �baseball or football�, �weather news�, �airplaneor skating�, �animal�, which can be roughly mapped tosome of the 15 categories in the data set. These resultsclearly show that the latent semantic topics learned byour models are able to capture the semantics of the videodata.

Another advantage of hierarchical harmonium, aswe discussed in Section 3.4, is that it reveals of therelationships between di�erent categories through thehidden topics. We can tell how much a category t isassociated with a latent topic j from the conditionalprobability p(yt|hj). Therefore, we are able to computethe similarity between any two categories by examiningthe hidden topics they are associated with. We show thepairwise similarity between the 15 categories using thecolor-coded confusion matrix in Figure 6, where red(er)color denotes higher similarity and blue(er) color de-notes lower similarity. We can see many meaningfulpairs of related categories, e.g., �mountain� is stronglyrelated to �animal�, �baseball� is related to �hockey�,while �studio� is not related to any category. These re-lationships are basically consistent with common sense.

5.2 Performance on video classi�cation To eval-uate the performance of the FoH and HH model in videoclassi�cation, we evenly divide our data set into a train-ing set and a test set. The model parameters are es-timated from the training set. Speci�cally, we imple-mented the learning methods based on the four inference

5 10 15 20 25 30 35 40 45 500.2

0.3

0.4

0.5

0.6

0.7

0.8

# of hidden topic nodes

Cla

ssifi

catio

n ac

cura

cyGM−MixtureGM−LDACorr−LDADWHFoHHH

Figure 7: Classi�cation performance of di�erent models

algorithms described in Section 4, in order to examinetheir e�ciency and accuracy. We also explore the issueof model selection, namely the impact of the number oflatent semantic topics to the classi�cation performance.

Several other methods have been implemented forcomparison, all of which produce intermediate represen-tation of some kind for the video data. First, we imple-mented the approach used in [17], which learns a dual-wing harmonium (DWH) from the data and then buildsa SVM classi�er based on the latent semantic represen-tations generated by DWH. We also implemented threedirected graphical models for representing video data,which are Gaussian multinomial mixture model (GM-Mixture), Gaussian multinomial latent Dirichlet allo-cation (GM-LDA), and correspondence latent Dirichletallocation (Corr-LDA). The details of these models canbe found in [1]. Similar to DWH, all the three directedmodels are used only for data representation, and eachof them requires a SVM classi�er for classi�cation. Tomake the experiments tractable on various models withdi�erent learning algorithms and di�erent numbers oflatent topics, we restrict this part of experiments to asubset of our collection with the 5 largest categories con-taining totally 1078 shots as airplane, basketball, base-ball, hockey, and weather news.

Figure 7 shows the classi�cation accuracies of theproposed FoH and HH models as well as the compari-son methods including DWH, GM-Mixture, GM-LDA,and Corr-LDA. To be fair, all the models are imple-mented using the mean �eld variational method (MF)for learning and inference, except GM-Mixture whichuses the expectation-maximization (EM) method. Allthe approaches are evaluated with the number of latentsemantic topics set to 5, 10, 20, and 50, in order tostudy the relationship between performance and modelcomplexity.

5 10 15 20 25 30 35 40 45 50

0.58

0.59

0.6

0.61

0.62

0.63

0.64

0.65

0.66

0.67

# of hidden topic nodes

Cla

ssifi

catio

n ac

cura

cy

CDMFLangevinGibbs

Figure 8: Classi�cation performance of di�erent approx-imate inference methods in hierarchical harmonium

Several interesting observations can be drawn fromFigure 7. First, the three undirected models as FoH,HH, and DWH achieve signi�cantly higher performancethan the directed models as GM-Mixture, GM-LDA,and Corr-LDA, which indicates that the harmoniummodel is an e�ective tool for video representation andclassi�cation. Among them, FoH is the best performerat 5 and 10 latent semantic nodes, while DWH is thebest performer at 20 and 50 latent nodes with HH as theclose runner-up. Second, we �nd that the performanceof FoH and HH is overall comparable with DWH.Given that DWH uses a SVM classi�er, this result isencouraging as it shows that our approach is comparableto the performance of a state-of-the-art discriminativeclassi�er. On the other hand, our approach enjoys manyadvantages that SVM does not have. For example,FoH can be easily extended to accommodate a newcategory without re-training the whole model. Third,the performance of DWH and HH improves as thenumber of latent topics increases, which agrees withour intuition because using more latent topics leadsto better representation of the data. However, thistrend is reversed in the case of FoH, which performsmuch better when using smaller number of latent topics.While a theoretical explanation of this is still unclear,in practice it is a good property of FoH to achieve highperformance with simpler models. Fourth, 20 seems tobe a reasonable number of latent semantic topics for thisdata set, since further increasing the number of topicsdoes not result in a considerable improvement of theperformance.

Figure 8 shows the classi�cation accuracies of HHmodel implemented using di�erent approximate infer-ence methods. From the graph, we can see that theLangevin and contrastive divergence (CD) methods per-form similarly, but are slightly better than mean-�led

(MF) and Gibbs sampling. We also study the e�ciencyof these inference methods by examining the time theyneed to reach convergence during training. The resultsshow that mean �eld is the most e�cient (approx. 2min), followed by CD and Langevin (approx. 9 min),and the slowest one is Gibbs sampling (approx. 49min).Therefore, Langevin and CD are good choices for thelearning and inference of our models in terms of bothe�ciency and classi�cation performance.

6 ConclusionWe have described two bipartite undirected models forsemantic representation and classi�cation of video data.The two models derive latent semantic representationof video data by jointly modeling the textual and imagefeatures of the data, and perform classi�cation based onsuch latent representations. Experiments on TRECVIDdata have demonstrated that our models achieve sat-isfactory performance on video classi�cation and pro-vide insights to the internal structure and relationshipsof video categories. Several approximate inference al-goirthms have been examined in terms of e�ciency andclassi�cation performance.

Our hierarchical harmonium by nature does notrestrict the number of categories an instance (shot)belongs to, since P (Yt = 1|x, z) can be high for multipleYt. Therefore, an interesting future work is to evaluatethe model with a multi-label data set, where eachinstance can belong to any number of categories. Inthis case, our method is actually a multi-task learning(MTL) method, and should be compared with otherMTL approaches. Our models can also be improvedusing better low-level features as input. The region-based color histogram features are quite sensitive toscale and illumination variations. Features such as localkeypoint features are more robust and can be easilyintegrated into our models. It is interesting to comparethe latent semantic interpretations and classi�cationperformance using di�erent features.

References

[1] D. M. Blei and M. I. Jordan. Modeling annotateddata. In SIGIR '03: Proceedings of the 26th annualInt'l ACM SIGIR Conf. on Research and developmentin informaion retrieval, pages 127�134, New York, NY,USA, 2003. ACM Press.

[2] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichletallocation. volume 3, pages 993�1022, Cambridge, MA,USA, 2003. MIT Press.

[3] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W.Furnas, and R. A. Harshman. Indexing by latentsemantic analysis. Journal of the American Society ofInformation Science, 41(6):391�407, 1990.

[4] T. Hastie, R. Tibshirani, and J. Friedman. The ele-ments of statistical learning: Data mining, inference,and prediction.

[5] T. Hofmann. Probabilistic latent semantic analysis. InProc. of Uncertainty in Arti�cial Intelligence, UAI'99,Stockholm, 1999.

[6] G. Iyengar and H. J. Nock. Discriminative modelfusion for semantic concept detection and annotationin video. In Proc. of the 11th ACM Int'l Conf. onMultimedia, pages 255�258, New York, NY, USA, 2003.ACM Press.

[7] M. I. Jordan. Learning in Graphical Models: Founda-tions of Neural Computation. The MIT press, 1998.

[8] F.-F. Li and P. Perona. A bayesian hierarchical modelfor learning natural scene categories. In Proc. ofthe 2005 IEEE Computer Society Conf. on ComputerVision and Pattern Recognition (CVPR), pages 524�531, Washington, DC, USA, 2005. IEEE ComputerSociety.

[9] I. Murray and Z. Ghahramani. Bayesian learningin undirected graphical models: Approximate mcmcalgorithms. In M. Chickering and J. Halpern, editors,Proceedings of the 20th Annual Conf. on Uncertainty inArti�cial Intelligence (UAI-04), pages 392�399. AUAIpress, 2004.

[10] Y. Rui, R. Jain, N. D. Georganas, H. Zhang, K. Nahrst-edt, J. Smith, and M. Kankanhalli. What is the stateof our community? In Proc. of the 13th annual ACMInt'l Conf. on Multimedia, pages 666�668, New York,NY, USA, 2005. ACM Press.

[11] A. Smeaton and P. Over. Trecvid: Benchmarking thee�ectiveness of infomration retrieval tasks on digitalvideo. In Proc. of the Intl. Conf. on Image and VideoRetrieval, 2003.

[12] C. Snoek, M. Worring, and A. Smeulders. Early versuslate fusion in semantic video analysis, 2005.

[13] M. Welling and G. E. Hinton. A new learning algo-rithm for mean �eld boltzmann machines. In ICANN'02: Proceedings of the Int'l Conf. on Arti�cial NeuralNetworks, pages 351�357, London, UK, 2002. Springer-Verlag.

[14] M. Welling, M. Rosen-Zvi, and G. Hinton. Exponen-tial family harmoniums with an application to infor-mation retrieval. In Advances in Neural InformationProcessing Systems 17, pages 1481�1488, Cambridge,MA, 2004. MIT Press.

[15] Y. Wu, E. Y. Chang, K. C.-C. Chang, and J. R.Smith. Optimal multimodal fusion for multimedia dataanalysis. In Proc. of the 12th annual ACM Int'l Conf.on Multimedia, pages 572�579, 2004.

[16] E. Xing, M. Jordan, and S. Russell. A generalizedmean �eld algorithm for variational inference in ex-ponential families. In Uncertainty in Arti�cial Intelli-gence (UAI2003). Morgan Kaufmann Publishers, 2003.

[17] E. Xing, R. Yan, and A. Hauptmann. Mining associ-ated text and images with dual-wing harmoniums. InProceedings of the 21th Annual Conf. on Uncertaintyin Arti�cial Intelligence (UAI-05). AUAI press, 2005.

[18] R. Yan, J. Yang, and A. G. Hauptmann. Learningquery-class dependent weights in automatic video re-trieval. In Proc. of the 12th ACM Int'l Conf. on Mul-timedia, pages 548�555. ACM Press, 2004.

APPENDIXThis is to show the derivation of the harmonium random�elds (joint distribution) in the family-of-harmoniummodel. We start by introducing the general form ofexponential-family harmonium [14] that has H as thelatent topic variables and X and Z as two types ofobserved data variables. This harmonium random �eldhas the exponential form as:

p(x, z,h) ∝ exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj) +X

kc

λkcekc(hk)

+X

ikac

W kcia fia(xi)ekc(hk) +

X

jkbc

Ukcjb gjb(zj)ekc(hk)

�.

where {fia(·)}, {gjb(·)}, and {ekc(·)} denote the suf-�cient statistics (features) of variables xi, zj , and hk,respectively.

The marginal distributions, say, p(x, z), is thenobtained by integrating out variables h:

p(x, z) =

Z

h

p(x, z,h)dh

∝ exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj)

�Y

k

Z

hk

exp

�Xc�

λkc +Xia

W kcia fia(xi) +

X

jb

Ukcjb gjb(zj)

�ekc(hk)

�dhk

= exp

�Xia

θiafia(xi) +X

jb

ηjbgjb(zj) +X

k

Ck({λkc})�

and similarly we can derive:

p(x,h) ∝ exp

�Xia

θiafia(xi)+X

kc

λkcgkc(hk)+X

j

Bj({ηjb})�

p(z,h) ∝ exp

�X

jb

ηjbgjb(zj)+X

kc

λkcekc(hk)+X

i

Ai({θia})�

where the shifted parameters θia, ηjb and λkc arede�ned as:

θia = θia +X

kc

W kcia ekc(hk), ηjb = ηjb +

X

kc

Ukcjb ekc(hk)

λkc = λkc +Xia

W kcia fia(xi) +

X

jb

Ukcjb gjb(zj)

The functions Ai(·), Bj(·), and Ck(·) are de�ned as:

Ai({θia}) =

Z

xi

exp{X

a

θiafia(xi)}dxi

Bj({ηjb})Z

zj

exp{X

b

ηjbgjb(zj)}dzj

Ck({λkc}) =

Z

hk

exp{X

c

λkcekc(hk)}dhk

Further integrating out variables from these distri-bution give the marginal distribution of x, z, and h.

p(x) ∝ exp

�Xia

θiafia(xi)+X

j

Bj({ηjb})+X

k

Ck({λkc})�

p(z) ∝ exp

�X

jb

ηjbgjb(zj) +X

i

Ai({θia}) +X

k

Ck({λkc}�

p(h) ∝ exp

�X

kc

λkcekc(hk)+X

i

Ai({θia})+X

j

Bj({ηjb})�

We all the above marginal distributions, we areready to derive the conditional distributions as:

p(x|h) =p(x,h)

p(h)∝Y

i

exp

�Xa

θiafia(xi)−Ai({θia})�

p(z|h) =p(z,h)

p(h)∝Y

j

exp

�X

b

ηjbgjb(zj)−Bj({ηjb})�

p(h|x, z) =p(x, z,h)

p(x, z)∝Y

k

exp

�Xc

λkcekc(hk)−Ck({λkc})�

The speci�c conditional distribution of x, z, and hde�ned in Eq.(3.2), (3.3), and (3.4) are all exponentialdistributions. They can be mapped to the general formsabove if we make the following de�nitions:

fi1(xi) = xi

θi1 = αi, θi1 = αi +X

k

Wikhk

gj1(zj) = zj , gj2(zj) = z2j

ηj1 = βj , ηj2 = −1/2, ηj1 = βj +X

k

Ujkhk

ek1 = hk, ek2 = h2k

λk1 = 0, λk2 = −1/2, λk1 =X

i

Wikhk +X

j

Ujkhk

Therefore, by plugging these de�nitions into generalform of harmonium random �eld at the beginning ofthis appendix, we have the speci�c random �eld as:

p(x, z,h) ∝ exp

�Pi αixi +

Pj βjzj −

Pj

z2j

2

−Pk

h2k2

+P

ik Wikxihk +P

jk Ujkzjhk

�

which is exactly the same as Eq.(3.5) except the latterone is de�ned for a speci�c category.

Documents

HarmoniumModelsforSemanticVideoRepresentationand Classi …epxing/papers/sdm07jyang.pdf · † A video shot s is represented by a tuple as (x;z;h;y), which respectively denote the