15
Soft Computing https://doi.org/10.1007/s00500-018-3244-4 METHODOLOGIES AND APPLICATION Bayesian inference by reversible jump MCMC for clustering based on finite generalized inverted Dirichlet mixtures Sami Bourouis 1 · Faisal R. Al-Osaimi 2 · Nizar Bouguila 3 · Hassen Sallay 4 · Fahd Aldosari 4 · Mohamed Al Mashrgy 3 © Springer-Verlag GmbH Germany, part of Springer Nature 2018 Abstract The goal of constructing models from examples has been approached from different perspectives. Statistical methods have been widely used and proved effective in generating accurate models. Finite Gaussian mixture models have been widely used to describe a wide variety of random phenomena and have played a prominent role in many attempts to develop expressive statistical models in machine learning. However, their effectiveness is limited to applications where underlying modeling assumptions (e.g., the per-components densities are Gaussian) are reasonably satisfied. Thus, much research efforts have been devoted to developing better alternatives. In this paper, we focus on constructing statistical models from positive vectors (i.e., vectors whose elements are strictly greater than zero) for which the generalized inverted Dirichlet (GID) mixture has been shown to be a flexible and powerful parametric framework. In particular, we propose a Bayesian density estimation method based upon mixtures of GIDs. The consideration of Bayesian learning is interesting in several respects. It allows to take uncertainty into account by introducing prior information about the parameters, it allows simultaneous parameters estimation and model selection, and it allows to overcome learning problems related to over- or under-fitting. Indeed, we develop a reversible jump Markov Chain Monte Carlo sampler for GID mixtures that we apply for simultaneous clustering and feature selection in the context of some challenging real-world applications concerning scene classification, action recognition, and video forgery detection. Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian inference · RJMCMC · Gibbs sampling · Scene classification · Action recognition · Video forgery Communicated by V. Loia. B Nizar Bouguila [email protected] Sami Bourouis [email protected] Faisal R. Al-Osaimi [email protected] Hassen Sallay [email protected] Fahd Aldosari [email protected] Mohamed Al Mashrgy [email protected] 1 Department of Information Technology, College of Computers and Information Technology, Taifuniversity, Taif, Kingdom of Saudi Arabia 1 Introduction In recent years, mobile phone cameras have become popular which has dramatically increased the amount of generated images and videos. Consequently, there is an urgent demand on developing approaches and models for annotation, index- ing and retrieval. In this context, many clustering approaches have been proposed (Zahn 1971; Ren et al. 2012; Gokcay and Príncipe 2002; Kanungo et al. 2002; Xu et al. 2013; Zhao and Zhang 2011; Mamat et al. 2013). Data clustering (Meila 2 Department of Computer Engineering, College of Computer Systems, Umm Al-Qura University, Mecca, Kingdom of Saudi Arabia 3 The Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada 4 College of Computer and Information Systems, Umm Al-Qura University, Mecca, Kingdom of Saudi Arabia 123

Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Soft Computinghttps://doi.org/10.1007/s00500-018-3244-4

METHODOLOGIES AND APPL ICAT ION

Bayesian inference by reversible jumpMCMC for clustering based onfinite generalized inverted Dirichlet mixtures

Sami Bourouis1 · Faisal R. Al-Osaimi2 · Nizar Bouguila3 · Hassen Sallay 4 · Fahd Aldosari4 ·Mohamed Al Mashrgy3

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

AbstractThe goal of constructing models from examples has been approached from different perspectives. Statistical methods havebeen widely used and proved effective in generating accurate models. Finite Gaussian mixture models have been widely usedto describe a wide variety of random phenomena and have played a prominent role in many attempts to develop expressivestatistical models in machine learning. However, their effectiveness is limited to applications where underlying modelingassumptions (e.g., the per-components densities are Gaussian) are reasonably satisfied. Thus, much research efforts have beendevoted to developing better alternatives. In this paper, we focus on constructing statistical models from positive vectors (i.e.,vectors whose elements are strictly greater than zero) for which the generalized inverted Dirichlet (GID) mixture has beenshown to be a flexible and powerful parametric framework. In particular, we propose a Bayesian density estimation methodbased upon mixtures of GIDs. The consideration of Bayesian learning is interesting in several respects. It allows to takeuncertainty into account by introducing prior information about the parameters, it allows simultaneous parameters estimationand model selection, and it allows to overcome learning problems related to over- or under-fitting. Indeed, we develop areversible jump Markov Chain Monte Carlo sampler for GID mixtures that we apply for simultaneous clustering and featureselection in the context of some challenging real-world applications concerning scene classification, action recognition, andvideo forgery detection.

Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian inference · RJMCMC · Gibbs sampling · Sceneclassification · Action recognition · Video forgery

Communicated by V. Loia.

B Nizar [email protected]

Sami [email protected]

Faisal R. [email protected]

Hassen [email protected]

Fahd [email protected]

Mohamed Al [email protected]

1 Department of Information Technology, College ofComputers and Information Technology, Taif university, Taif,Kingdom of Saudi Arabia

1 Introduction

In recent years, mobile phone cameras have become popularwhich has dramatically increased the amount of generatedimages and videos. Consequently, there is an urgent demandon developing approaches and models for annotation, index-ing and retrieval. In this context, many clustering approacheshave been proposed (Zahn 1971; Ren et al. 2012; Gokcay andPríncipe 2002; Kanungo et al. 2002; Xu et al. 2013; Zhaoand Zhang 2011; Mamat et al. 2013). Data clustering (Meila

2 Department of Computer Engineering, College of ComputerSystems, Umm Al-Qura University, Mecca, Kingdom ofSaudi Arabia

3 The Concordia Institute for Information Systems Engineering(CIISE), Concordia University, Montreal, QC, Canada

4 College of Computer and Information Systems, UmmAl-Qura University, Mecca, Kingdom of Saudi Arabia

123

Page 2: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

2007) is useful for discovering groups and finds its wayinto many applications from various domains such as patternrecognition, computer vision, data mining, remote sensing,andBioinformatics (Pizzuti and Talia 2003; Guha et al. 1998;Das and Konar 2009; Bong and Rajeswari 2011; Mishraet al. 2012). The Many clustering algorithms have been pro-posed in the past. These algorithms can be roughly classifiedinto: (1) algorithms based on partition (e.g., K-means, K-medoids), (2) algorithms based on hierarchy (e.g., BIRCH),algorithms based on fuzzy theory (e.g., Fuzzy C-means),algorithms based on distribution (e.g.,mixturemodels), algo-rithms based on density (e.g., DBSCAN), algorithms basedongraph theory (e.g.,CLICK), algorithmsbasedongrid (e.g.,CLIQUE), algorithms based on fractal theory (e.g., FC), andalgorithms based on model (e.g., COBWEB) (Xu and Tian2015). For more details about these algorithms, the readeris referred to Xu and Tian (2015) and references therein. Inthis paper, we focus on distribution-based (or parametric-based) approaches. Among the most widely used parametricapproaches, finite mixture models have aroused consider-able interest in the last few years from both theoretical andpractical point of views with hundreds of applications suchas analysis of network traffic (Hajji 2005), gene expressionsanalysis (McLachlan and Khan 2004), image databases sum-marization (Bouguila 2007), image and video segmentation(Allili et al. 2007). In particular, there is a significant body ofwork on finite Gaussian mixture models. Although the Gaus-sian assumption underlies most mixture-based approaches(Vlassis et al. 1999; Vlassis and Likas 1999), several recentresearch works have shown that this choice is not appro-priate in several applications where the data partitions areclearly non-Gaussian. The increasing interest in clusteringnon-Gaussian data has led to a new emphasis on designingmore efficient and effective newmixture models. In this con-text, the work in this paper builds on recent research resultsthat have shown that the GID, which includes the invertedDirichlet (Bdiri and Bouguila 2012) as a special case, is effi-cient for themodeling of positive data for which theGaussianmixture is not an appropriate choice (Mashrgy et al. 2014;Bourouis et al. 2014). Indeed, the main goal of this paper isto propose a unified Bayesian framework that tackles simul-taneously the estimation and selection of finite GIDmixtureswhile performing unsupervised feature selection.

The main research problems when considering finite mix-tures are estimating the mixture’s parameters and estimatingthe number of mixture components. Concerning the esti-mation of the parameters, two main groups of approachescan be considered namely frequentist (e.g., maximum like-lihood) and Bayesian techniques. The EM algorithm is themain approach to obtain maximum likelihood estimates forfinite mixture distributions, and was used in Mashrgy et al.(2014) in the case of the generalized inverted Dirichlet, but itsuffers from slow convergence, dependency on initialization,

and possible convergence to local or saddle points. Bayesianinference has beenwidely adopted inmany applications fromdifferent disciplines to overcome these problems (Lin andLee 2007; Heitz and Koller 2008; Neal 2003; Rufo et al.2006). Bayesian modeling is a mainstay in applied statisticssince it provides a natural way to integrate prior information(Baldi and Long 2001). It was applied successfully in thecase of the GID mixture in Bourouis et al. (2014) where anMCMC algorithm has been developed using the fact that theGID belongs to the exponential family (Geiger et al. 2001) ofdistribution to develop its parameters priors. The applicationofMCMCalgorithms to learning statisticalmodels in general(Tu and Zhu 2002; Chib and Winkelmann 2001) and mix-ture models in particular (Cabral et al. 2008) has producedinteresting results. MCMC improves generally the abilityto locate the global maximum of the likelihood function ascompared to the EM algorithm and its different extensions(Dias and Wedel 2004). Concerning the estimation of thenumber of components, several selection criteria have beenproposed such as the Bayesian information criterion (BIC)suggested in Schwarz (1978) and applied in several Bayesianframeworks as an efficient approximation to the marginallikelihood (Bouguila et al. 2009; Bouguila 2011). A majordrawback of this approach is that it requires the evaluationof a given selection criterion for different numbers of com-ponents which is time consuming. A better approach that weshall develop in this paper, for simultaneous parameters andnumber of components estimation, is RJMCMC that general-izes the traditional MCMC to the case where the dimensionof the unknown parameters in the model is also unknown.RJMCMC can be viewed as a Bayesian model averagingprocedure that selects the model automatically by producingthe posterior probability of the number of components, uponwhich one draws a conclusion on how many components areneeded to model the data (Richardson and Green 1997). ARJMCMC approach has been proposed in Richardson andGreen (1997) to learn Gaussian mixture models. Yet, to thebest of our knowledge RJMCMC inference of GID mixtureshas not been done before.

RJMCMC has been applied in several challenging realproblems (Wang and Zhu 2004; Ho andHu 2008). The devel-opment of RJMCMC methods has made it possible to fitadequately large classes of statistical models. A challeng-ing problem when using RJMCMC in real-life applicationsis the high dimensionality of the data. Generally, statisti-cal learning of high-dimensional data is a difficult problem(Hinton 1999; Cohen and Richman 2002; Bickel and Levina2004; Ruta and Porikli 2012; Bouveyron and Brunet 2012).To circumvent this difficulty, several dimensionality reduc-tion approaches have been proposed (McLachlan et al. 2003).In particular, feature selection techniques have received a lotof attention. Although the majority of these approaches havebeen developed in supervised settings by considering labeled

123

Page 3: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

instances to help removing noisy and redundant features,many interesting unsupervised feature selection techniqueshave been proposed (Law et al. 2004; Cai et al. 2010; He et al.2011; Chen 2014. In recent years the unsupervised featureselection approach proposed inLawet al. (2004) has attractedattention as a principled flexible method when using mixturemodels. The key idea is to suppose that a given feature isirrelevant if its distribution follows a common density acrossclasses and thus is independent of the class labelswhich trans-lates into an elegant formalization of the feature selectionproblem. Thus, we shall integrate it in our RJMCMC-basedBayesian framework. The resulting statistical framework isvalidated using challenging real-world applications concern-ing scene classification, action recognition, and video forgerydetection.

The rest of this paper is organized as follows. In Sect. 2,we review the GID mixture model and we illustrate how tointegrate feature selection within it. Section 3 is dedicatedto the development of our RJMCMC algorithm. Section 4presents our validation via several interesting real applica-tions. Some concluding remarks and ideas for further worksare finally given in Sect. 5.

2 Themixture model

Mixture models are well-established approach to unsu-pervised learning for complex applications involving datadefined in high-dimensional heterogeneous (non homoge-neous) spaces. In this section, we briefly describe the GIDmixture and the unsupervised feature selection approachbased on this mixture model. Although this paper is self-contained, we refer the interested reader to Mashrgy et al.(2014) for detailed discussions and analysis. It is noteworthythat all the notations used in this paper are summarized inTable 1.

Let Y = {Y1, . . . ,YN } be a data set containing N objectswhere Yi is a D-dimensional positive vector describing thei th object that needs to be assigned into one of the M groupsin the data set. A good generative model to represent a suchdata set is the GID mixture (Bourouis et al. 2014):

p(Yi |Θ) =M∑

j=1

p j p(Yi |θ j ) (1)

whereΘ = (θ1, θ2, . . . , θM , p1, p2, . . . , pM ), (p1, p2, . . . ,pM ) is the vector of mixing weights which must be positiveand sum to one, p(Yi |θ j ) is theGID distributionwith param-eters θ j :

p(Yi |θ j ) =D∏

d=1

Γ (α jd + β jd)

Γ (α jd)Γ (β jd)

Yα jd−1id

(1 + ∑dl=1 Yil)

γ jd(2)

where θ j = (α j1, β j1, . . . , α j D, β j D), γ jd = β jd + α jd −β jd+1 for d = 1, . . . , D with β j D+1 = 0.

The GID has an interesting property, previously shown inMashrgy et al. (2014). Indeed, if a vector Yi has a GID withparameters (α1, β1, . . . , αD, βD), then we can construct avectorX using the followinggeometric transformation Xi1 =Yi1 and Xil = Yil

1+∑l−1k=1 Yik

for l > 1, such that each Xid has

an inverted Beta distribution with parameters (αd , βd):

pI Beta(Xid |αd , βd) = Γ (αd + βd)

Γ (αd)Γ (βd)

Xαd−1id

(1 + Xid)(αd+βd )(3)

This property means, as shown in Mashrgy et al. (2014), thatthe GID can be transformed to a multidimensional invertedBetamixturemodel with conditionally independent features:

p(Xi |Θ) =M∑

j=1

p j

D∏

d=1

pI Beta(Xid |α jd , β jd) (4)

The mean and the variance of the inverted Beta distributionare as follows

μ jd = α jd

β jd − 1(5)

σ 2jd = α jd(α jd + β jd − 1)

(β jd − 2)(β jd − 1)2(6)

Using Eqs. 5 and 6, the parameters α jd and β jd of invertedBeta distribution can be written with respect to the mean andthe variance as follows

α jd = μ jd2(1 + μ jd) + μ jdσ

2jd

σ 2jd

(7)

β jd = μ jd(1 + μ jd) + 2σ 2jd

σ 2jd

(8)

then, the probability density function of the inverted Beta, asa function of its mean and variance, can be written as follows

pI Beta(Xid |μ jd , σ2jd)

= X

(μ jd

2(1+μ jd )+μ jdσ2jd

σ2jd−1

)

id

B(

μ jd2(1+μ jd )+μ jdσ 2

jd

σ 2jd

,μ jd (1+μ jd )+2σ 2

jd

σ 2jd

)

×(1 + Xid

)−(

μ jd2(1+μ jd )+μ jdσ2jd+μ jd (1+μ jd )+2σ2jd

σ2jd

)

where B is the beta function which is defined as B(a, b) =Γ (a)Γ (b)Γ (a+b) .

123

Page 4: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

Table 1 Summary of the mainnotations used in the paper

Symbol Meaning

Y Objects set

N Number of objects in YD Dimensionality of the positive vectors

Yi D-dimensional positive vector

Θ Parameters of the mixture model

p j Mixing weight

M Number of mixture components

θ j = (α j1, β j1, . . . , α j D, β j D) Parameters of the GID representing component j

Xi D-dimensional vector obtained after thegeometric transformation

μ jd Mean of the inverted Beta distribution

σ 2jd Variance of the inverted Beta distribution

ρd Weight of feature d

λd Parameters of the common inverted Beta

Θ∗ Unsupervised feature selection model parameters

Zi Membership vector

ε jd Location for the inverted Beta prior

ζ jd Shape for the inverted Beta prior

ϑ Shape parameter of inverse Gamma prior

Scale parameters of inverse Gamma prior

δ Parameters of the Dirichlet prior

(ϕ, �) and (λ, ν) Parameters of inverse Gamma hyperpriors

φ Parameters of the exponential hyperprior

Statistical models often assume that all features areequally important. In reality this is not the case since differentfeatures may have different saliencies (Tan 1993; Mashrgyet al. 2014). Some may even compromise the clustering pro-cess by limiting generalization capabilities. Finding a propertradeoff between complexity and goodness of fit is crucialin several applications. This can be done via feature selec-tion which aims principally to reduce the dimensionality ofthe data by removing irrelevant features. The mathematicalproperties of the GIDmixture offer the potential to model thedensity of high-dimensional vectors adequately while alsoallowing both clustering and feature selection. An interestingunsupervised feature selection formulation has been previ-ously applied successfully for the GID mixture in Mashrgyet al. (2014) and we shall adopt it here within Bayesian set-tings in our RJMCMC learning framework. The main idea isto suppose that a given feature Xid is generated from a mix-ture of two univariate distributions. The first one is assumedto generate relevant features and is different for each cluster,and the second one is common to all clusters (i.e., indepen-dent from class labels) and assumed to generate irrelevantfeatures. In our case, this idea can be formulated as follows

p(Xi |Θ∗) =M∑

j=1

p j

D∏

d=1

[ρd pI Beta(Xid |θ jd)

+ (1 − ρd)pI Beta(Xid |λd) ] (9)

where Θ∗ = {{p j }, {θ jd}, {ρl}, {λl}} is the set of allour unsupervised feature selection model parameters thatshould be estimated, θ jd = (μ jd , σ

2jd), ρl represents the

probability that feature Xil is relevant for clustering, andpI Beta(Xil |λl) is an inverted Beta distribution with parame-ter λl = (μλ|l , σ 2

λ|l), common to all clusters and supposed togenerate irrelevant features.

3 RJMCMC-based learning

The unsupervised feature selection model formalized byEq. 9was learned inMashrgy et al. (2014) via an expectation-maximization (EM) algorithm used to minimize a messagelength objective that allows simultaneously the selection ofthe optimal number of component and the relevant features.Several studies have shown, however, that Bayesian learning

123

Page 5: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

techniques offer better generalization capabilities thanks tothe introduction of prior information about the parameters tolearn. Thus, the goal here is to propose a Bayesian alternativeto the approachproposed inMashrgy et al. (2014) usingRJM-CMC inference that has been extensively studied and usedby several researchers (Zhang et al. 2004; Jasra et al. 2007)especially when we have multiple parameter subspaces, ofdifferent dimensionalities, for which it is necessary to devisedifferent move types.

3.1 Bayesianmodel

Unlike classic mixture models formulations, the number ofcomponentsM is considered here as a parameter in themodelfor which a conditional distribution should be found. Thus,our unknowns are M , θ = (θ1, . . . , θM ), such that θ j =(θ j1, . . . , θ j D), and P = (p1, . . . , pM ) and are regarded asrandom variables drawn from some prior distributions thatwe have to specify. The joint distribution of all these variablesis

p(M,P, Z , θ ,Y) = p(M)p(P|M)

P(Z |P, M)p(θ |Z ,P, M)

p(Y|θ, Z ,P, M)

where Z = {Z1, . . . ,ZN }, and each Zi is a M-dimensionalmembership vector (known also as the unobserved or miss-ing vector) that indicates to which component Yi belongs,such that: Zi j will be equal 1 if Yi belongs to class j or 0,otherwise. It is noteworthy that the Zi j are supposed to bedrawn independently from the following distribution

p(Zi j = 1) = p j j = 1, . . . , M . (10)

By imposing common conditional independencies (see, forinstance, Richardson and Green 1997), we get the followingjoint distribution

p(M,P, Z , θ ,Y)

= p(M)p(P|M)P(Z |P, M)p(θ |M)p(Y|θ, Z) (11)

The goal of Bayesian inference is to generate realizationsfrom the conditional joint density p(M,P, Z , θ |Y). Part ofthe difficulty of Bayesian learning concerns the selectionof priors which reflect our degree of belief on the param-eters values. This is mainly due to the fact that a formalapproach to select these priors does not exist. For our modelwe have considered some common choices. For instance,we have chosen an inverted Beta and inverse Gamma dis-tributions as priors for the mean μ jd and the variance σ 2

jd ,respectively:

p(μ j |ε, ζ ) =D∏

d=1

1

B(

ε2(1+ε)+εζζ

,ε(1+ε)+2ζ

ζ

(ε2(1+ε)+εζ

ζ−1

)

jd

×(1 + μ jd

)−(

ε2(1+ε)+εζ+ε(1+ε)+2ζζ

)

(12)

where μ j = (μ j1, . . . , μ j D), ε jd is the location and ζ jd isthe shape parameter for the invertedBeta distribution.A com-mon choice as a prior for the variance σ 2

j = (σ 2j1, . . . , σ

2j D)

is the inverse gamma distribution, then

p(σ 2j |ϑ, ) ∼

D∏

d=1

ϑ

Γ ( )σ 2− −1

jd exp

σ 2

)(13)

where ϑ and represent the shape and scale parameters ofinverse Gamma distribution, respectively. Using Eqs. 12 and13, we have

p(θ |M, τ ) =M∏

j=1

p(μ j |ε, ζ )p(σ 2j |ϑ, ) (14)

where τ = (ε, ζ, ϑ, ) are the hyperparameters of θ . There-fore, the full conditional posterior distribution for the meanμ j and the variance σ 2

j can be written as follows:

p(μ j | . . . ) ∝M∏

j=1

p(μ j |ε, ζ )p(σ 2j |ϑ, )

N∏

i=1

p(Xi |Θ Zi )

∝ p(μ j |ε, ζ )

N∏

i=1

p(Xi |θ Zi ) (15)

p(σ 2j | . . . ) ∝

M∏

j=1

p(μ j |ε, ζ )p(σ 2j |ϑ, )

N∏

i=1

p(Xi |θ Zi )

∝ p(σ 2j |ϑ j , j )

N∏

i=1

p(Xi |θ Zi ) (16)

The | . . . is used to denote conditioning on all other variables.In addition, the typical prior choice for the mixing weightP is the Dirichlet distribution since it is defined under theconstraint of p1, . . . , pM : ∑M

j=1 p j < 1. Then, the priorcan be written as follows:

p(P|M, δ) =Γ

(∑Mj=1 δ j

)

∏Mj=1 Γ (δ j )

M∏

j=1

pδ j−1j (17)

Also, the prior of the membership variable Z is:

p(Z |P, M) =M∏

j=1

pn jj (18)

123

Page 6: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

where n j represents the number of vectors belonging to j thcluster. Using Eqs. 17 and 18 we get

p(P| . . . ) ∝ p(Z |P, M)p(P|M, δ)

∝M∏

j=1

pn jj

Γ (∑M

j=1 δ j )∏M

j=1 Γ(δ j

)M∏

j=1

pδ j−1j (19)

∝ pn j+δ j−1j

which is proportional to a Dirichlet distribution with param-eters (δ1 + n1, . . . , δM + nM ). Besides, using Eq. 10 themembership variable posterior can be obtained as follows

p(Zi = j | . . . ) ∝ p j

D∏

d=1

pI Beta(Xid |θ jd) (20)

Another hierarchical level can be introduced to represent thepriors of the hyperparameters in the model. First, the hyper-parameters, ε and ζ which are associated with μ j , are givenuniform and inverse Gamma priors, respectively:

p(ε) ∼ U[a,b] (21)

p(ζ |ϕ, �) ∼ �ϕ exp(−�/ζ )

Γ (ϕ)ζ ϕ+1 (22)

where a = min{Xil , i = 1, . . . , N ; l = 1, . . . , D} and b =max{Xil , i = 1, . . . , N ; l = 1, . . . , D}. According to thetwo previous equations, the conditional posterior for ε and ζ

can be written as:

p(ε| . . . ) = p(ε)M∏

j=1

p(μ j |ε, ζ ) (23)

p(ζ | . . . ) = p(ζ |ϕ, �)

M∏

j=1

p(μ j |ε, ζ ) (24)

Also, the hyperparameters for ϑ and , which are associatedwith the variance σ 2

j , are given inverse Gamma and exponen-tial priors, respectively:

p(ϑ |λ, ν) ∼ νλ exp(−ν/ϑ)

Γ (λ)ϑλ+1 (25)

p( |φ) ∼ φ exp(−φ ) (26)

From these twopervious equations, the conditional posteriorsfor ϑ and are written as:

p(ϑ | . . . ) ∝ p(ϑ |λ, ν)

M∏

j=1

p(σ 2j |ϑ, ) (27)

p( | . . . ) ∝ p( |φ)

M∏

j=1

p(σ 2j |ϑ, ) (28)

Specifying hyperpriors allows us to take into account uncer-tainty about the hyperparameters values rather than fixingthem arbitrarily. Finally, for the number of components, M ,the common choice is uniform distribution between 1 and apredefined integer Mmax.

3.2 RJMCMCmethodology

The RJMCMC methodology developed in Richardson andGreen (1997) enables the simultaneous estimation of the pos-terior probabilities of severalmodels under consideration andthe parameters conditional on a specificmodel. It generalizesthe classic Metropolis–Hastings (M–H) algorithm into thedimension varying situation. It is designed such that the sam-plermoves across different dimensions according to six typesof moves (See Algorithm 1). It is noteworthy that (1), (2), (3)

Algorithm 1 RJMCMC Moves1: Update the mixing parameters P2: Update the parameters μ j and σ 2

j3: Update the allocation Z4: Update the hyperparameters ϑ, , ε, ζ

5: Split one component into two, or combine two into one6: The birth or death of an empty component

and (4) are usual parameter update moves via the Gibbs sam-pling. On the other hand, (5) and (6) are trans-dimensionalmoves (involve changing the number of components by one),they constitute the reversible jump and are used for modelselection via the M–H algorithm which has been extensivelystudied and adopted in the context of many applications fromdifferent disciplines (Liu et al. 2000; Kato 2008). Each stept = 1, . . . , 6 is called a move and a sweep is defined as acomplete pass over the six moves. Assume that we are instate ΔM , where ΔM = (Z , P, M). The MCMC step rep-resenting move (5) takes the form of a Metropolis–Hastingsstep by proposing a move from a state ΔM to Δ̂M with tar-get probability distribution (posterior distribution) p(ΔM |Y)

and proposal distribution qt (ΔM , Δ̂M ) for the move t . Whenwe are in the current state ΔM , a given move t to destinationΔ̂M is accepted with probability

pt (ΔM , Δ̂M ) = min

(1,

p(Δ̂M |Y)qt (Δ̂M ,ΔM )

p(ΔM |Y)qt (ΔM , Δ̂M )

)(29)

In the case of a move type where the dimension of the param-eter does not change, we use an ordinary ratio of densities. Amove from a point ΔM to Δ̂M in a higher dimensional spaceis done by drawing a vector of continuous random variablesu, independent of ΔM , and the new state Δ̂M is determinedby using an invertible deterministic function of ΔM and u:f(ΔM , u) (Richardson and Green 1997). On the other hand,

123

Page 7: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

themove from Δ̂M toΔM can be carried out using the inversetransformation. Hence, the move acceptance probability isgiven by

pt (ΔM , Δ̂M ) = min

(1,

p(Δ̂M |Y)rm(Δ̂M )

p(ΔM |Y)rm(ΔM )q(u)

∣∣∣∣∂(Δ̂M )

∂(ΔM , u)

∣∣∣∣

)(30)

where rm(ΔM ) is the probability of choosing move type mwhen we are in state ΔM , and q(u) is the density function

of u. The last term ∂(Δ̂M )∂(ΔM ,u)

is the Jacobian function arising

from the variable change from (ΔM , u) to state Δ̂M .The first four steps of RJMCMC are based on simple

Gibbs sampling where the parameters are drawn from theirknown full conditional distributions. Thefirstmove is to drawthe mixing weight from a Dirichlet distribution as shownin Eq. 19. The second move is based on drawing the mix-ture’s parameters using Eqs. 12 and 13. According to theseequations, it is clear that the full conditional distributionsare complex and do not have well-known forms. Thus, weused the (M–H) algorithm (Liu et al. 2000). At sweep t, themean μ jl can be generated using the M–H algorithm (SeeAlgorithm 2). The most important issue in M–H algorithm

Algorithm 2 M–H algorithm

1: Generate μ̂ j ∼ q(μ j |μ(t−1)

j

)and u ∼ U[0,1]

2: Calculate r = p(μ̂ j |... )q(μ

(t−1)j |μ̂ j

)

p(μ(t−1)j |... )q

(μ̂ j |μ(t−1)

j

)

3: if r < u then return μtj else return μ

(t−1)j

is choosing the candidate generating density q (proposal dis-tribution) in order to keep the means μ jd ∈ [a, b]. A popularchoice for q is a random walk where the previously simu-lated parameter μ jd value is used to generate the following

value μ̂ jd . We propose to generate the new mean μ(t)jd from

inverted Beta distribution (w.r.t. mean and variance), whereits mean is the previous mean valueμ

(t−1)jd and its variance is

a constant value C (we take empirically C = 2.5). The newgenerated value of the mean using the proposal distributionis

μ̂ jd ∼ pI Beta(μ

(t−1)jd ,C

)(31)

For the variance σ 2jd

(t), the generation is performed using

Algorithm 3, where the proposal distribution q is given by

σ̂ 2jd ∼ LN

(σ 2jd

(t−1), e2

)(32)

where LN refers to the lognormal distribution with mean

log(σ 2(t−1)

jd

)and variance e2. The third move is to gen-

erate the missing data Zi (1 ≤ i ≤ N ) from a simulated

Algorithm 3 σ 2jd

(t)generation

1: Generate σ̂ 2jd ∼ q

(σ 2

jd |σ 2(t−1)jd

)and u ∼ U[0,1]

2: Calculate r = p(σ̂ 2jd |... )q

(σ 2(t−1)

jd |σ̂ 2jd

)

p(σ 2(t−1)jd |... )q

(σ̂ 2jd |σ 2(t−1)

jd

)

3: if r < u then return σ 2tjd else return σ 2(t−1)

jd

standard uniform random variables ui , Zi = j if p(Zi =1| . . . ) + · · · + p(Zi = j − 1| . . . ) < ui ≤ p(Zi =1| . . . ) + . . . p(Zi = j | . . . ). Finally, the Gibbs samplingis used to update the hyperparameters ε, ζ, ϑ ,and givenby Eqs. 23, 24, 27, and 28, respectively.

Inmove (5),wemake a randomchoice between attemptingto split or to combine, with probabilities aM and bM , respec-tively, where bM = 1 − aM . It is clear that, aMmax = 0 andb1 = 0; otherwise we choose aM = bM = a1 = bMmax =0.5 for M = 2, . . . , Mmax − 1, where Mmax is the maximumvalue allowed for M . The combine move is constructed byrandomly choosing a pair of components ( j1, j2), whichmustbe adjacent; in otherwords theymustmeet the following con-straint: μ j1 < μ j2 , where there is no other μ j in the interval[μ j1 ,μ j2 ]. Then, these two components can be merged andM is reduced by 1. We denote the new formed componentby j∗ which contains all the observations that were allocatedto j1 and j2. Finally, we generate the parameter values forthe new components p j∗,μ j∗ , σ

2j∗ by preserving the zeroth,

first, and second moments which are calculated as follows

p j∗ = p j1 + p j2 (33)

p j∗μ j∗ = p j1μ j1 + p j2μ j2 (34)

p j∗(μ j∗ + σ 2j∗) = p j1(μ j1 + σ 2

j1) + p j2(μ j2 + σ 2j2) (35)

For the split type move, a component j∗ is chosen randomlyand we split it into two components j1 and j2 with newparameters p j1,μ j1 , and σ 2

j1and p j2 ,μ j2 , and σ 2

j2, respec-

tively, which confirms Eqs. 33, 34, and 35. Since there are3 degrees of freedom in achieving this, we need to generate,from a Beta distribution, a three-dimensional random vectoru = [u1, u2, u3] to define the new parameters (Richardsonand Green 1997). And we set

p j1 = w j∗u1 p j1 = w j∗(1 − u1) (36)

μ j1 = μ j∗ − u2

√σ 2

j∗p j2

p j1

μ j2 = μ j∗ + u2

√σ 2

j∗p j1

p j2(37)

σ 2j1 = u3(1 − u22)σ

2j∗p j∗

p j1

σ 2j2 = (1 − u3)(1 − u22)σ

2j∗p j∗

p j2(38)

123

Page 8: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

For the new generated components, the adjacency conditiondefined in the combine move must be checked to make surethat the split/combine is reversible or not. If this conditionis rejected, split/combine move is not reversible; the splitmove is rejected. Otherwise, the split move is accepted andwe reallocate the j∗ into the new components j1 and j2 usingEq. 10. According to Eq. 30, the acceptance probability R forthe split and combine moves types can be calculated usingthe following

R = P(Z , P, M + 1, ε, ζ, , ϑ |Y)bM+1

p(Z , P, M, ε, ζ, , ϑ |Y)aM Pallocq(u)

∣∣∣∣∂Δ̂M

∂(ΔM , u)

∣∣∣∣ (39)

where the acceptance probability for the split is min(1, R),and for the combine move is min(1, R−1). Palloc is theprobability of making this particular allocation to compo-

nents j1 and j2,

∣∣∣∣∂Δ̂M

∂(ΔM ,u)

∣∣∣∣ is the Jacobian of the transfor-

mation from the state (p j∗ ,μ j∗ , σ2j∗ , u1, u2, u3) to state

(p j1,μ j1 , σ2j1, p j2 ,μ j2 , σ

2j2)

In Death/Birth move, we first make a random choicebetween birth and death with the same aM and bM as above.If the birthmove is chosen, the values of the parameters of thenew components (μ j∗ , σ

2j∗) are drawn from the associated

prior distributions given by Eqs. 15 and 16, respectively. Theweight of the new component, p j∗, is generated from themarginal distribution of p j∗ derived from the distributionof P = (p1, . . . , pM , p j∗). The vector P follows a Dirichletwith parameters (δ1, . . . , δM , δ j∗); thus themarginal of p j∗ isaBeta distributionwith parameters (δ j∗,

∑Mj=1 δ j ). Note that

in order to keep the mixture constraint∑M

j=1 p j + p j∗ = 1,the previous weights p j , j = 1, . . . , M have to be rescaledand then each multiplied by (1 − p j∗). The acceptanceprobabilities for the birth and death are min{1, R} andmin{1, R−1}, respectively, where

R = p(M + 1)

p(M)

1

B(δ, Mδ)pδ−1j∗ (1 − p j∗)N+Mδ−M

(M + 1)aM+1

M0bM

1

p(p j∗)(1 − pMj∗) (40)

where B is beta function and M0 is the number of emptycomponents before the birth. The Jacobian corresponding tothe birth move is then (1 − p j∗)M . For the opposite move,we choose randomly an existing empty component to delete,then of course the remaining weights have to be rescaled tokeep the unit-sum constraint. The acceptance probabilitiesof birth and death moves: min{1, R} and min{1, R−1}, arecalculated according to:

R = p(M + 1)

p(M)

Γ (δ j∗ + ∑Mj=1 δ j )

Γ (δ j∗)Γ (∑M

j=1 δ j )p

δ j∗−1j∗ (M + 1)

× (1 − p j∗)N+∑Mj=1 δ j−M bM+1

aM (M0 + 1)

1

p(p j∗)(1 − p j∗)M (41)

where M0 is the number of empty components before thebirth. Finally, the parameters λd are generated using a M–Halgorithm as the one explained above for the inverted Betaparameters of the model’s relevant part. Having the param-eters of both model’s parts, the estimation of {ρd} becomesstraightforward taking into account the fact that we can eas-ily calculate the probability of a given feature to be relevant

which is given byρd pI Beta(Xid |θ jd )

ρd pI Beta(Xid |θ jd )+(1−ρd )pI Beta(Xid |λd ). The

complete algorithm is summarized in Algorithm 4.

Algorithm 41: Select some values for the hyperparameters.2: Initialize randomly all parameters.3: Repeat until convergence:4: Update the mixing parameters P using Eq. 19.5: Update the parameters μ j and σ 2

j using Algorithm 2 and Algorithm3, respectively.

6: Update the allocation Z using Eq. 20.7: Update the features weights using ρd =

ρd pI Beta (Xid |θ jd )

ρd pI Beta (Xid |θ jd )+(1−ρd )pI Beta (Xid |λd ).

8: Update the hyperparameters ϑ, , ε, ζ using Eqs. 27, 27, 23, and24, respectively.

9: Split one component into two, or combine two into one, or the birth,or the death of an empty component using the approach explainedabove.

4 Experimental results

In this section, we start by investigating the flexibilityand efficiency of the GID mixture model, without fea-ture selection (BGIDM), as compared to the widely stud-ied Gaussian mixture in a real-life challenging applicationnamely video forgery detection. Then, we investigate ourclustering Bayesian GID approach with feature selection,that we denote BGIDM+FS, using two real challengingapplications from computer vision domain namely visualscene classification and action recognition. Other compara-ble approaches, namely the samemodel learned using an EMalgorithmwith (EMGIDM+FS) andwithout feature selection(EMGIDM) (Mashrgy et al. 2014), Gaussian mixture modelwith (BGM+FS) and without feature selection (BGM) usingRJMCMC,Gaussianmixturemodelwith (EMBGM+FS) andwithout feature selection (EMBGM) using EM, have beenconsidered to investigate our model, also. Please note thatthe main goal of this section is to test our model againstcomparable approaches and not against the state of the art

123

Page 9: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

Fig. 1 Original and inpainted frames used to evaluate the detection approach

techniques related to the selection applicationswhich is obvi-ously out of the scope of this paper. In these applicationsour specific empiric choices for the hyperparameters wereη1 = . . . , ηM = 1, and (ϕ, �, λ, φ) = (2,5,0.2,1). Wehave used 20 parallel chains and considered the criterionof Gelman and Rubin (Gelman and Rubin 1992) to detectconvergence. In general for M = 1, 2, the Gelman–Rubin

scale reduction factor came down to 1 within 50 iterations;however, for M > 3, the scale reduction factor was slightlylarger than 1 even after 2000 iterations. A burn-in period of1000 iterations followed by 10,000 iterations was sufficientfor convergence in these applications.

123

Page 10: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

4.1 Video forgery detection

Video forgery detection has been an active research topic inthe last few years (Wang and Farid 2007a, b; Hsu et al. 2008;Kobayashi et al. 2010). It can be viewed as the problem ofreliably distinguishing between tampered videos and untam-pered original ones. Indeed, with the advent of sophisticatedmultimedia editing softwares, that allow the manipulation ofvideos, the integrity of a given video content cannot be takenas granted. This strong interest is driven by a wide spectrumof promising applications in many areas such as forensicsand security (e.g., a video can contain a hidden information),and journalism (e.g., using videos as critical evidence) toname a few. Unlike image forgery detection, the problem hasreceivedmuch attention just recently, and the development ofnew efficient methods is urgent. To solve such problem, onemay think of extending techniques used for image tamperingdetection to each frame of the treated video. Unfortunately,it will be difficult to detect some types of forgery becauseframes are evaluated independently and there is no consid-eration of the correlation between the frames. Among thesuccessful approaches, the authors in Hsu et al. (2008) pro-posed to exploit noise residual as a feature characteristicsextracted from the video and to use block-level correla-tion technique. The distribution of correlation of temporalnoise residue in a tampered video, in forged and normalregions, was supposed to be aGaussianmixturemodel whichparameters are learned using the expectation maximizationalgorithm. Consequently, a Bayesian classifier is used to findthe optimal threshold value based on the estimated parame-ters. The goal of this application is to investigate the detectionresults when a GIDmixture is considered instead to motivatefurther this particular choice. In our experiments,we considerfive video sequences (see Fig. 1) forged using the inpaintingapproach proposed in Mosleh et al. (2012).

We have evaluated the forgery detection performanceusing precision and recall rates that we have calculated fromavailable ground truth:

Precision = Nhit

Nhit + Nfalse(42)

Recall = Nhit

Nhit + Nmiss(43)

where Nhit represents the number of correct detections, Nfalse

denotes the number of false positives, and Nmiss denotes thenumber of misses. Tables 3 and 2 display the classificationresults using the BGIDM and BGM models, respectively.

The results show that our mixture model provides betterdetection results as compared to the Gaussian model whichcan be explained by its flexibility and ability to representdifferent forms and shapes.

Table 2 Performance evaluation of the forgery detection for five testsequences using the Bayesian Gaussian mixture (BGM)

Method Precision Recall

Sequence 1 61.88 90.01

Sequence 2 62.07 90.15

Sequence 3 60.98 89.68

Sequence 4 62.37 90.82

Sequence 5 61.90 91.03

Table 3 Performance evaluation of the forgery detection for five testsequences using the Bayesian GID mixture (BGIDM)

Method Recall Precision

Sequence 1 65.23 93.45

Sequence 2 66.93 94.20

Sequence 3 64.87 92.96

Sequence 4 66.68 94.22

Sequence 5 65.77 93.08

Table 4 Accuracies (%) for classification of the 13 and 15 categoriesdata sets using different approaches

Method Data set 1 Data set 2

BGIDM+FS 75.24% ± 0.17 70.89 % ± 0.28

BGIDM 73.16% ± 0.25 69.53 % ± 0.33

EMGIDM+FS 74.07% ± 0.19 69.78 % ± 0.31

EMGIDM 72.94% ± 0.22 68.99 % ± 0.31

BGM+FS 70.22% ± 0.31 66.73 % ± 0.38

BGM 68.75% ± 0.36 65.64 % ± 0.39

EMGM+FS 68.26% ± 0.28 64.88 % ± 0.26

EMGM 67.88% ± 0.29 63.71 % ± 0.29

4.2 Scene classification

Technological advances have increased the quantity ofimages generated everyday. This is clear from photo-sharingsites on the internet that contain billions of publicly avail-able images (Quack et al. 2004; Crandall et al. 2009). In thiscontext, object detection and recognition (Liu et al. 2004;Shen and Bai 2006; Zhang et al. 2007), visual scenes under-standing and classification, and content-based image retrievalhave received a lot of attention (Duygulu et al. 2002; Lien-hart et al. 2003; Hadjidemetriou et al. 2004; Bao et al. 2010;Schiele and Crowley 2000; Gondra and Heisterkamp 2008;Karthikeyan and Aruna 2013). Visual scenes indexing andcategorization in particular have been the topic of extensiveresearch (Chang et al. 1988; Maree et al. 2005; Pandey andLazebnik 2011). Typically, global (e.g., color, shape, tex-ture) and/or local features are extracted from each image andorganized into a feature vector. The clustering is then per-

123

Page 11: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

Fig. 2 Sample images from each group in the first data set. a Highway, b inside of cities, c tall buildings, d streets, e forest, f coast, g mountain,h open country, i suburb residence, j bedroom, k kitchen, l living room, m office

Fig. 3 Additional categories in the second data set. a–c Store, d–f industrial

Fig. 4 Examples of frames of different human actions from video sequences in the UCF sports dataset action dataset

formed in the corresponding feature space where the imagesare viewed as points. In this section, we apply our modelto the important problem of scenes classification using localfeatures.

We have considered two well-known data sets, also.The first data set contains 13 categories of natural scenes(Lazebnik et al. 2006) and consists of 13 categories: coasts(360 images), forest (328 images), mountain (374 images),

open country (410 images), highway (260 images), insideof cities (308 images), tall buildings (356 images), streets(292 images), suburb residence (241 images), bedroom (174images), kitchen (151 images), living room (289 images),and office (216 images). Figure 2 shows examples of imagesfrom this data set. The second data set contains 15 categories(Lazebnik et al. 2006) and consists of the 13 categories of thesecond data set plus 626 other images divided into two cat-

123

Page 12: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

Table 5 Accuracies (%) for classification of the 13 and 15 categoriesdata sets using different approaches

Method Accuracy

BGIDM+FS 79.30% ±1.02

BGIDM 78.84% ± 1.25

EMGIDM+FS 76.77% ± 1.19

EMGIDM 76.04% ± 1.24

BGM+FS 73.33% ± 1.33

BGM 72.95% ± 1.35

EMGM+FS 72.66% ± 1.32

EMGM 72.59% ± 1.36

egories (see Fig. 3): store (315 images) and industrial (311images). We divided each of these data sets 10 times ran-domly into two separate halves, half for training and half fortesting.

As we mentioned previously is the extraction of low-levelfeatures. Many invariant local features have been proposedin the past (Guo et al. 2009). A feature that we have usehere is SIFT (Lowe 2004), which is based upon the his-togram of gradients adjusted by the dominant orientations,computedondetectedkeypoints of all images andgiving128-dimensional vector for each keypoint. Moreover, extractedvectors were clustered by using the K-Means algorithm on800 visual words. (We have tested several vocabulary sizesand the best classification results were obtained with 800visual words.) Each image in the data sets was then rep-resented by a 800-dimensional positive vector describingthe frequencies of a set of visual words, provided from theconstructed visual vocabulary. Having these feature vectors,the Probabilistic Latent Semantic Analysis (pLSA) model(Hofmann 2001) is applied by considering 35 topics, fordimensionality reduction which has been shown to improveclassification performance (Quelhas et al. 2007). The clas-sification of the scenes is then reduced to the classificationof the resulting 35-dimensional positive vectors using ourstatisticalBayesianmodel (BGIDM+FS)which gave an aver-age classification accuracy of 75.24% ± 0.17 in the caseof the 13 categories data set and 70.89 % ± 0.28 for thesecond data set. Table 4 presents classification results usingdifferent approaches. According to this table it is clear thatthe best results are obtained using approach. We can con-clude also that feature selection improves the results and thatBayesian inference performs better than its frequentist coun-terpart.

4.3 Action recognition

Automatically understanding events in general (Zelnik-Manor and Irani 2001; Davis and Bobick 1997) and human

action in particular is a challenging problem in computervision which has been widely studied in the past (Aggar-wal and Cai 1999; Rao and Shah 2001; BenAbdelkaderet al. 2004; Bobick and Davis 2001; Liu et al. 2017). Ithas several important applications such as video surveil-lance and monitoring, tracking, human–computer interfaces,event recognition, and augmented reality (Leibe et al. 2005;Duan et al. 2012; García et al. 2010). Action recognitionapproaches can be classified into different groups accord-ing to the representation used and the classification methodconsidered. Concerning the representation methods, global,local, and depth-based representations have been consid-ered in the past. For classification methods, template-basedmethods, discriminative models and generative models havebeen proposed. For a detailed recent survey the readeris referred to Zhang et al. (2017) and references therein.The goal of this section is to validate our model via itsapplication to the recognition of human activities in videosequences. An important step in this case is the descrip-tion of these activities. Several space-time interest pointsand descriptions have been proposed in the past (Chomatand Crowley 1999; Laptev and Lindeberg 2004) to cap-ture and describe local events in videos. Among variousexisting space-time interest points detectors and local spa-tiotemporal features, we adopt the so-called cuboid detector1

(Dollar et al. 2005) which has shown its effectiveness inmodeling human activities. The Cuboid detector is based ontemporal Gabor filters and a histogram of the cuboid typesare used as our activity descriptor. Our approach to con-duct this task is based on the following: first, we employthe Cuboid detector to extract local spatiotemporal featuresfrom each video sequence. After that, K -Means algorithmis applied on the obtained spatiotemporal features to con-struct a visual vocabulary. We have investigated differentsizes of the visual vocabulary from 100 to 1000. The opti-mal classification performance was obtained when the sizeof the visual vocabulary is set to 800. Then, as in the previ-ous application pLSA model (Hofmann 2001) is adopted torepresent each video sequence by a 50-dimensional vector.Our model is applied then as a classifier to categorize thevideos.

We have used in our experiment on a very challenging andpopular data set, namely the UCF sports data set2 (Rodriguezet al. 2008) which was collected by the UCF group from var-ious sports featured on broadcast television channels such asthe BBC and ESPN. It consists of over 200 video sequencesat a resolution of 720 × 480 with nine actions, such as:“diving,” “golf swinging” (g_swinging), “kicking,” “lifting,”

1 The source code of cuboid detector is at: http://vision.ucsd.edu/~pdollar.2 Data set is available at: http://vision.eecs.ucf.edu/datasetsActions.html.

123

Page 13: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

“horseback riding” (riding), “running,” “skating,” “swing-ing,” and “walking.” Some examples of frames from eachaction class are displayed in Fig. 4. For the UCF sports dataset, we used 70% of the video sequences to construct thevisual vocabulary. The results that we will present in the fol-lowing are obtained over 10 runs. The average recognitionaccuracies using different algorithms are shown in Table 5.As we can observe from this table, BGIDM+FS provided thebest performance. The results clearly demonstrate again theadvantage of using feature selection and the superior perfor-mance of the GIDmixture as compared to the finite Gaussianmixture.

5 Conclusion

Clustering allows to group data elements according to theirsimilarity. The topic ismature and has a rich history spanningmany research areas such as data mining, computer vision,pattern recognition, information retrieval, andmachine learn-ing. In this context, GID mixture has become recently underclose scrutiny, from both frequentist- and Bayesian-basedperspectives. In this work, we adopt a Bayesian approach,and a detailed treatment that bases inference on samplingfrom the model’s posterior distribution. Indeed, this paperconsiders both the cases of knownmixture size, using aGibbssampling approach with a Metropolis step, and unknownmixture size, using a RJMCMC techniques that allows tomove from onemixture size to another.We have explored theusefulness of RJMCMC when learning GID distributions totackle several challenging applications namely video forgerydetection, scenes classification, and human action recogni-tion. From the obtained results, we can conclude that theGID mixture models are sound from both a theoretical andpractical point of view, since all distributions on the strictlypositive reals can be approximated by them.Moreover, skew-ness is better modeled by such an asymmetric distribution.Some future outstanding research issues may concern thedevelopment of better learning techniques such as the vari-ational approach. Indeed, in recent years, researchers haverecognized that variational learning approaches are wellsuited for learning complex models due to their relativelylow computational requirements as compared to MCMCtechniques. Indeed, MCMC algorithms are computationallyintensive and diagnosing their convergence is still a com-plex problem especially in high-dimensional settings. Otherfuture works could be devoted to the application of theproposed model on other challenging tasks from differentdomains.

Acknowledgements The authors would like to thank Umm al-QuraUniversity, Kingdom of Saudi Arabia, for their funding support underGrant Number 15-COM-3-1-0007.

Compliance with ethical standards

Conflict of interest All authors declare that they have no conflict ofinterest.

References

Aggarwal JK, Cai Q (1999) Human motion analysis: a review. ComputVis Image Underst 73(3):428–440

Allili MS, Bouguila N, Ziou D (2007) Finite generalized Gaussian mix-ture modeling and applications to image and video foregroundsegmentation. In: Proceedings of the fourth canadian conferenceon computer and robot vision (CRV), pp 183–190

Baldi P, Long AD (2001) A bayesian framework for the analysis ofmicroarray expression data: regularized t -test and statistical infer-ences of gene changes. Bioinformatics 17(6):509–519

Bao SYZ, Sun M, Savarese S (2010) Toward coherent object detec-tion and scene layout understanding. In: Proceedings of the EEEcomputer society conference on computer vision and patternrecognition (CVPR), pp 65–72

Bdiri T, Bouguila N (2012) Positive vectors clustering using inverteddirichlet finitemixturemodels. Expert Syst Appl 39(2):1869–1882

BenAbdelkader C, Cutler RG, Davis LS (2004) Gait recognition usingimage self-similarity. EURASIP J Appl Signal Process 2004:572–585

Bickel PJ, Levina E (2004) Some theory for Fisher’s linear discriminantfunction, ‘naiveBayes’, and some alternativeswhen there aremanymore variables than observations. Bernoulli 10(6):989–1010

Bobick A, Davis J (2001) The recognition of human movementusing temporal templates. IEEE Trans Pattern Anal Mach Intell23(3):257–267

Bong CW, Rajeswari M (2011) Multi-objective nature-inspired clus-tering and classification techniques for image segmentation. ApplSoft Comput 11(4):3271–3282

Bouguila N (2007) Spatial color image databases summarization. In:Proceedings of the IEEE international conference on acoustics,speech, and signal processing, ICASSP, pp 953–956

Bouguila N (2011) Bayesian hybrid generative discriminative learn-ing based on finite liouville mixture models. Pattern Recognit44(6):1183–1200

Bouguila N, Ziou D, Hammoud RI (2009) On bayesian analysis of afinite generalized dirichlet mixture via a metropolis-within-gibbssampling. Pattern Anal Appl 12(2):151–166

Bourouis S, Mashrgy MA, Bouguila N (2014) Bayesian learning offinite generalized inverted dirichlet mixtures: application to objectclassification and forgery detection. Expert Syst Appl 41(5):2329–2336

Bouveyron C, Brunet C (2012) Simultaneous model-based clusteringand visualization in the fisher discriminative subspace. Stat Com-put 22(1):301–324

Cabral CRB, Bolfarine H, Pereira JRG (2008) Bayesian density esti-mation using skew student-t-normal mixtures. Comput Stat DataAnal 52(12):5075–5090

Cai D, Zhang C, He X (2010) Unsupervised feature selection formulti-cluster data. In: Proceedings of the 16th ACM SIGKDDinternational conference on knowledge discovery and data mining(KDD), pp 333–342

Chang S, Yan C, Dimitroff D, Arndt T (1988) An intelligent imagedatabase system. IEEE Trans Softw Eng 14(5):681–688

ChenC (2014) Feature selection based on compactness and separability:comparison with filter-based methods. Comput Intell 30(3):636–656

Chib S, Winkelmann R (2001) Markov chain Monte Carlo analysis ofcorrelated count data. J Bus Econ Stat 19(4):428–435

123

Page 14: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

S. Bourouis et al.

Chomat O, Crowley J (1999) Probabilistic recognition of activity usinglocal appearance. In: Proceedings of the IEEE computer societyconference on computer vision and pattern recognition (CVPR),vol 2, pp 104–109

Cohen WW, Richman J (2002) Learning to match and cluster largehigh-dimensional data sets for data integration. In: Proceedings ofthe Eighth ACM SIGKDD international conference on knowledgediscovery and data mining (KDD), pp 475–480

CrandallDJ, BackstromL,HuttenlocherDP,Kleinberg JM (2009)Map-ping the world’s photos. In: Proceedings of the 18th internationalconference on world wide web (WWW), ACM, pp 761–770

Das S, Konar A (2009) Automatic image pixel clustering with animproved differential evolution. Appl Soft Comput 9(1):226–236

Davis J, Bobick A (1997) The representation and recognition of humanmovement using temporal templates. In: Proceedings of the IEEEcomputer society conference on computer vision and patternrecognition (CVPR), pp 928–934

Dias JG, Wedel M (2004) An empirical comparison of em, SEMand MCMC performance for problematic gaussian mixture likeli-hoods. Stat Comput 14(4):323–332

Dollar P, Rabaud V, Cottrell G, Belongie S (2005) Behavior recogni-tion via sparse spatio-temporal features. In: Proceedings of IEEEinternational workshop on visual surveillance and performanceevaluation of tracking and surveillance (VS-PETS), pp 65 – 72

Duan L, Xu D, Tsang IWH, Luo J (2012) Visual event recognition invideos by learning from web data. IEEE Trans Pattern Anal MachIntell 34(9):1667–1680

Duygulu P, BarnardK, de Freitas JFG, ForsythDA (2002)Object recog-nition as machine translation: learning a lexicon for a fixed imagevocabulary. In: Heyden A, Sparr G, Nielsen M, Johansen P (eds)ECCV (4), Lecture notes in computer science, vol 2353. Springer,pp 97–112

García JM, Benitez LR, Fernández-Caballero A, López MT (2010)Video sequence motion tracking by fuzzification techniques. ApplSoft Comput 10(1):318–331

Geiger D, Heckerman D, King H, Meek C (2001) Stratified expo-nential families: graphical models and model selection. Ann Stat29(2):505–529

Gelman A, Rubin DB (1992) Inference from iterative simulation usingmultiple sequences. Stat Sci 7(4):457–472 (with discussion)

Gokcay E, Príncipe JC (2002) Information theoretic clustering. IEEETrans Pattern Anal Mach Intell 24(2):158–171

Gondra I, Heisterkamp DR (2008) Content-based image retrieval withthe normalized information distance. Comput Vis Image Underst111(2):219–228

Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algo-rithm for large databases. In: Haas LM, Tiwary A (eds) SIGMODconference. ACM Press, pp 73–84

Guo X, Cao X, Zhang J, Li X (2009) Mift: a mirror reflection invariantfeature descriptor. In: ZhaH, ichiroTaniguchiR,MaybankSJ (eds)ACCV (2), Lecture notes in computer science, vol 5995. Springer,pp 536–545

Hadjidemetriou E, Grossberg MD, Nayar SK (2004) Multiresolutionhistograms and their use for recognition. IEEE Trans Pattern AnalMach Intell 26(7):831–847

Hajji H (2005) Statistical analysis of network traffic for adaptive faultsdetection. IEEE Trans Neural Netw 16(5):1053–1063

HeX, JiM, ZhangC, BaoH (2011)A varianceminimization criterion tofeature selectionusing laplacian regularization. IEEETransPatternAnal Mach Intell 33(10):2013–2025

Heitz G, Koller D (2008) Learning spatial context: using stuff to findthings. In: Forsyth DA, Torr PHS, Zisserman A (eds) ECCV (1),Lecture notes in computer science, vol 5302. Springer, pp 30–43

Hinton G (1999) Products of experts. In: Proceedings of the ninth inter-national conference on artificial neural networks (ICANN), vol 1.IEEE, pp 1–6

Ho RKW, Hu I (2008) Flexible modelling of random effects in lin-ear mixed models—a bayesian approach. Comput Stat Data Anal52(3):1347–1361

HofmannT (2001)Unsupervised learning byprobabilistic latent seman-tic analysis. Mach Learn 42(1/2):177–196

Hsu CC, Hung TY, Lin CW, Hsu CT (2008) Video forgery detectionusing correlation of noise residue. In: 2008 IEEE 10th workshopon multimedia signal processing, pp 170–174

Jasra A, Stephens DA, Holmes CC (2007) Population-based reversiblejump Markov chain Monte Carlo. Biometrika 94(4):787–807

Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R, WuAY (2002) An efficient k-means clustering algorithm: analysis andimplementation. IEEE Trans Pattern Anal Mach Intell 24(7):881–892

Karthikeyan M, Aruna P (2013) Probability based document clusteringand image clustering using content-based image retrieval. ApplSoft Comput 13(2):959–966

Kato Z (2008) Segmentation of color images via reversible jumpMCMC sampling. Image Vis Comput 26(3):361–371

Kobayashi M, Okabe T, Sato Y (2010) Detecting forgery from static-scene video based on inconsistency in noise level functions. IEEETrans Inf Forensics Secur 5(4):883–892

Laptev I, Lindeberg T (2004) Velocity adaptation of space-time interestpoints. In: Proceedings of the 17th international conference onpattern recognition (ICPR), vol 1, pp 52–56

Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatialpyramid matching for recognizing natural scene categories. In:Proceedings of the IEEEcomputer society conference on computervision and pattern recognition (CVPR), vol 2, pp 2169–2178

Leibe B, Seemann E, Schiele B (2005) Pedestrian detection in crowdedscenes. In: Proceedings of the IEEE computer society conferenceon computer vision and pattern recognition (CVPR), vol 1. IEEEComputer Society, pp 878–885

Lienhart R, Kuranov A, Pisarevsky V (2003) Empirical analysis ofdetection cascades of boosted classifiers for rapid object detection.In: Michaelis B, Krell G (eds) DAGM-symposium, Lecture notesin computer science, vol 2781. Springer, pp 297–304

Lin TI, Lee JC (2007) Bayesian analysis of hierarchical linear mixedmodeling using themultivariate t distribution. J Stat Plan Inference137(2):484–495

Liu JS, Liang F, Wong WH (2000) The multiple-try method andlocal optimization in Metropolis sampling. J Am Stat Assoc95(449):121–134

Liu D, Lam K, Shen L (2004) Optimal sampling of gabor features forface recognition. Pattern Recognit Lett 25(2):267–276

Liu X, He GF, Peng SJ, Cheung YM, Tang YY (2017) Efficient humanmotion retrieval via temporal adjacent bag of words and discrim-inative neighborhood preserving dictionary learning. IEEE TransHum Mach Syst 47(6):763–776

Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous featureselection and clustering using mixture models. IEEE Trans PatternAnal Mach Intell 26(9):1154–1166

Lowe DG (2004) Distinctive image features from scale-invariant key-points. Int J Comput Vis 60(2):91–110

Mamat R, Herawan T, Deris MM (2013) Mar: Maximum attribute rela-tive of soft set for clustering attribute selection. Knowl Based Syst52:11–20

Maree R, Geurts P, Piater J, Wehenkel L (2005) Random subwin-dows for robust image classification. In: Proceedings of the IEEEcomputer society conference on computer vision and patternrecognition (CVPR), vol 1, pp 34–40

Mashrgy MA, Bdiri T, Bouguila N (2014) Robust simultaneous pos-itive data clustering and unsupervised feature selection usinggeneralized inverted dirichlet mixture models. Knowl Based Syst59:182–195

123

Page 15: Bayesian inference by reversible jump MCMC for clustering ...amansystem.com/apps/publications/papers/Bayesian... · Keywords Finite mixtures · Generalized inverted Dirichlet · Bayesian

Bayesian inference by reversible jump MCMC for clustering based on finite generalized…

McLachlanG, KhanN (2004) On a resampling approach for tests on thenumber of clusters with mixture model-based clustering of tissuesamples. J Multivar Anal 90(1):90–105

McLachlan G, Peel D, Bean R (2003)Modelling high-dimensional databy mixtures of factor analyzers. Comput Stat Data Anal 41:379–388

Meila M (2007) Comparing clusterings—an information based dis-tance. J Multivar Anal 98(5):873–895

Mishra NS, Ghosh S, Ghosh A (2012) Fuzzy clustering algorithmsincorporating local information for change detection in remotelysensed images. Appl Soft Comput 12(8):2683–2692

Mosleh A, Bouguila N, Hamza AB (2012) Video completion usingbandlet transform. IEEE Trans Multimed 14(6):1591–1601

Neal RM (2003) Slice sampling. Ann Stat 31(3):705–767Pandey M, Lazebnik S (2011) Scene recognition and weakly super-

vised object localization with deformable part-based models. In:Proceedings of the IEEE international conference on computervision (ICCV), pp 1307–1314

Pizzuti C, Talia D (2003) P-autoclass: scalable parallel clustering formining large data sets. IEEE Trans Knowl Data Eng 15(3):629–641

Quack T, Mönich U, Thiele L, Manjunath BS (2004) Cortina: a systemfor large-scale, content-basedweb image retrieval. In: Proceedingsof the 12th ACM international conference on multimedia (MM).ACM, pp 508–511

Quelhas P, Monay F, Odobez JM, Gatica-Perez D, Tuytelaars T (2007)A thousand words in a scene. IEEE Trans Pattern AnalMach Intell29(9):1575–1589

Rao C, Shah M (2001) View-invariance in action recognition. In: Proc.of the IEEE computer society conference on computer vision andpattern recognition (CVPR), vol 2. IEEE Computer Society, pp316–322

Ren Y, Liu X, Liu W (2012) Dbcamm: a novel density based clusteringalgorithm via using the mahalanobis metric. Appl Soft Comput12(5):1542–1554

Richardson S, Green PJ (1997) On bayesian analysis of mixtures withan unknown number of components. J R Stat Soc Ser B 59(4):731–792 (with discussion)

Rodriguez M, Ahmed J, Shah M (2008) Action mach a spatio-temporalmaximum average correlation height filter for action recognition.In: Proceedings of IEEEconference on computer vision andpatternrecognition (CVPR), pp 1–8

Rufo M, Martn J, Prez C (2006) Bayesian analysis of finite mixturemodels of distributions from exponential families. Comput Stat21(3–4):621–637

Ruta A, Porikli F (2012) Compressive clustering of high-dimensionaldata. In: Proceedings of the 11th international conference onmachine learning and applications, (ICMLA), pp 380–385

Schiele B, Crowley JL (2000) Recognition without correspondenceusing multidimensional receptive field histograms. Int J ComputVis 36(1):31–50

Schwarz G (1978) Estimating the dimension of a model. Ann Stat6:461–464

Shen L, Bai L (2006) Mutualboost learning for selecting gabor featuresfor face recognition. Pattern Recognit Lett 27(15):1758–1767

Tan M (1993) Cost-sensitive learning of classification knowledge andits applications in robotics. Mach Learn 13(1):7–33

Tu Z, Zhu SC (2002) Image segmentation by data-driven markov chainmonte carlo. IEEE Trans Pattern Anal Mach Intell 24(5):657–673

Vlassis N, Likas A (1999) A kurtosis-based dynamic approach to gaus-sian mixture modeling. IEEE Trans Syst Man Cybern Part A SystHum 29(4):393–399

VlassisN, PapakonstantinouG, Tsanakas P (1999)Mixture density esti-mation based onmaximum likelihood and sequential test statistics.Neural Process Lett 9(1):63–76

Wang W, Farid H (2007a) Exposing digital forgeries in interlaced anddeinterlaced video. IEEE Trans Inf Forensics Secur 2(3):438–449

Wang W, Farid H (2007b) Exposing digital forgeries in video bydetecting duplication. In: Proceedings of the 9thworkshop onmul-timedia and security. ACM, New York, NY, USA, pp 35–42

Wang Y, Zhu SC (2004) Analysis and synthesis of textured motion:particles and waves. IEEE Trans Pattern Anal Mach Intell26(10):1348–1363

Xu D, Tian Y (2015) A comprehensive survey of clustering algorithms.Ann Data Sci 2(2):165–193

XuD,Xu Z, Liu S, ZhaoH (2013) A spectral clustering algorithm basedon intuitionistic fuzzy information. Knowl Based Syst 53:20–26

Zahn C (1971) Graph-theoretical methods for detecting and describinggestalt clusters. IEEE Trans Comput C–20(1):68–86

Zelnik-Manor L, Irani M (2001) Event-based analysis of video. In:Proceedings of the IEEEcomputer society conference on computervision and pattern recognition (CVPR), vol 2, pp II–123–II–130

Zhang Z, Chan KL,WuY, Chen C (2004) Learning a multivariate gaus-sian mixture model with the reversible jump MCMC algorithm.Stat Comput 14(4):343–355

Zhang B, Shan S, Chen X, Gao W (2007) Histogram of gabor phasepatterns (hgpp): a novel object representation approach for facerecognition. IEEE Trans Image Process 16(1):57–68

Zhang S, Wei Z, Nie J, Huang L, Wang S, Li Z (2017) A review onhuman activity recognition using vision-based method. J HealthcEng (Article ID 3090343)

Zhao P, Zhang CQ (2011) A new clustering method and its applicationin social networks. Pattern Recognit Lett 32(15):2109–2118

Publisher’s Note Springer Nature remains neutral with regard to juris-dictional claims in published maps and institutional affiliations.

123