10
Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds Mina Ashizawa 1 Hiroaki Sasaki 2,3 Tomoya Sakai 1 Masashi Sugiyama 3,1 1 The University of Tokyo, Japan 2 Nara Institute of Science and Technology, Japan 3 RIKEN, Japan Abstract Mean shift is a mode-seeking clustering al- gorithm that has been successfully used in a wide range of applications such as im- age segmentation and object tracking. To further improve the clustering performance, mean shift has been extended to various di- rections, including generalization to handle data on Riemannian manifolds and extension to directly estimating the log-density gradi- ent without density estimation. In this pa- per, we combine these ideas and propose a novel mode-seeking algorithm for Rieman- nian manifolds with direct log-density gradi- ent estimation. Although the idea of combin- ing the two extensions is rather straightfor- ward, directly estimating the log-density gra- dient on Riemannian manifolds is mathemat- ically challenging. We will provide a mathe- matically sound algorithm and demonstrate its usefulness through experiments. 1 Introduction Clustering is one of the most important unsupervised learning tasks in machine learning and has been exten- sively studied for decades (Clarke et al., 2009; Murphy, 2012). Among various different types of clustering algorithms, mode-seeking is a well-studied and prac- tically useful approach. Mean shift (Fukunaga and Hostetler, 1975; Cheng, 1995; Comaniciu and Meer, 2002; Carreira-Perpin´ an, 2015) is a seminal algorithm for mode-seeking clustering: Kernel density estimation is first performed on given data points and then the data points are updated along the gradient of the es- timated density towards the modes. Finally, the data Proceedings of the 20 th International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- erdale, Florida, USA. JMLR: W&CP volume 54. Copy- right 2017 by the author(s). points which converged to the same mode are given the same cluster label. A notable advantage of mean shift is that it does not require to specify the number of clusters in advance. Thanks to this useful property, mean shift has been successfully employed in a wide range of real-world applications such as image segmen- tation (Wang et al., 2004; Tao et al., 2007) and object tracking (Comaniciu et al., 2000; Collins, 2003). The original mean shift algorithm considers data points in the Euclidean space. However, in practice, data points sometimes lie on a structured space such as the Lie group and Grassmann manifold. For data on such a structured space, kernel density estimation and gradient ascent with the Euclidean metric do not necessarily perform appropriately. To cope with this problem, mean shift has been extended to Riemannian manifolds (Boothby, 2003) and demonstrated to work well in experiments (Tuzel et al., 2005; Subbarao and Meer, 2006, 2009; Cetingul and Vidal, 2009). Another important extension of mean shift is to avoid density estimation. The original mean shift algorithm uses kernel density estimation, which tends to perform poorly when the data dimension is high. Furthermore, a good density estimator does not necessarily mean a good density gradient estimator, and thus the two- step approach of first estimating the density and then computing its gradient does not always perform well. To cope with this problem, a method to directly es- timate the log-density gradient without density esti- mation has been developed (Cox, 1985; Sasaki et al., 2014), and a mode-seeking clustering algorithm based on the direct log-density gradient estimator was exper- imentally shown to work well (Sasaki et al., 2014). The purpose of this paper is to combine these two extensions and propose a novel clustering algorithm based on direct log-density gradient estimation and mode-seeking on Riemannian manifolds. Although the idea of combining the two extensions is rather straight- forward, directly estimating the density gradient on Riemannian manifolds is mathematically challenging. We will provide a mathematically sound algorithm and demonstrate its usefulness through experiments.

Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clusteringfor Riemannian Manifolds

Mina Ashizawa1 Hiroaki Sasaki2,3 Tomoya Sakai1 Masashi Sugiyama3,1

1The University of Tokyo, Japan 2Nara Institute of Science and Technology, Japan 3RIKEN, Japan

Abstract

Mean shift is a mode-seeking clustering al-gorithm that has been successfully used ina wide range of applications such as im-age segmentation and object tracking. Tofurther improve the clustering performance,mean shift has been extended to various di-rections, including generalization to handledata on Riemannian manifolds and extensionto directly estimating the log-density gradi-ent without density estimation. In this pa-per, we combine these ideas and propose anovel mode-seeking algorithm for Rieman-nian manifolds with direct log-density gradi-ent estimation. Although the idea of combin-ing the two extensions is rather straightfor-ward, directly estimating the log-density gra-dient on Riemannian manifolds is mathemat-ically challenging. We will provide a mathe-matically sound algorithm and demonstrateits usefulness through experiments.

1 Introduction

Clustering is one of the most important unsupervisedlearning tasks in machine learning and has been exten-sively studied for decades (Clarke et al., 2009; Murphy,2012). Among various different types of clusteringalgorithms, mode-seeking is a well-studied and prac-tically useful approach. Mean shift (Fukunaga andHostetler, 1975; Cheng, 1995; Comaniciu and Meer,2002; Carreira-Perpinan, 2015) is a seminal algorithmfor mode-seeking clustering: Kernel density estimationis first performed on given data points and then thedata points are updated along the gradient of the es-timated density towards the modes. Finally, the data

Proceedings of the 20th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-erdale, Florida, USA. JMLR: W&CP volume 54. Copy-right 2017 by the author(s).

points which converged to the same mode are giventhe same cluster label. A notable advantage of meanshift is that it does not require to specify the numberof clusters in advance. Thanks to this useful property,mean shift has been successfully employed in a widerange of real-world applications such as image segmen-tation (Wang et al., 2004; Tao et al., 2007) and objecttracking (Comaniciu et al., 2000; Collins, 2003).

The original mean shift algorithm considers datapoints in the Euclidean space. However, in practice,data points sometimes lie on a structured space suchas the Lie group and Grassmann manifold. For dataon such a structured space, kernel density estimationand gradient ascent with the Euclidean metric do notnecessarily perform appropriately. To cope with thisproblem, mean shift has been extended to Riemannianmanifolds (Boothby, 2003) and demonstrated to workwell in experiments (Tuzel et al., 2005; Subbarao andMeer, 2006, 2009; Cetingul and Vidal, 2009).

Another important extension of mean shift is to avoiddensity estimation. The original mean shift algorithmuses kernel density estimation, which tends to performpoorly when the data dimension is high. Furthermore,a good density estimator does not necessarily mean agood density gradient estimator, and thus the two-step approach of first estimating the density and thencomputing its gradient does not always perform well.To cope with this problem, a method to directly es-timate the log-density gradient without density esti-mation has been developed (Cox, 1985; Sasaki et al.,2014), and a mode-seeking clustering algorithm basedon the direct log-density gradient estimator was exper-imentally shown to work well (Sasaki et al., 2014).

The purpose of this paper is to combine these twoextensions and propose a novel clustering algorithmbased on direct log-density gradient estimation andmode-seeking on Riemannian manifolds. Although theidea of combining the two extensions is rather straight-forward, directly estimating the density gradient onRiemannian manifolds is mathematically challenging.We will provide a mathematically sound algorithm anddemonstrate its usefulness through experiments.

Page 2: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

2 Problem Formulation

In this section, we formulate the clustering problem bymode-seeking and review existing algorithms.

Clustering by Mode-Seeking: Suppose that weare given independent and identically distributed sam-ples of size n on Rd, X = {xi}ni=1, with unknownprobability density p(x). The goal of clustering is tosplit the set X into c disjoint subsets {Xi}ci=1 so thatsamples in each subset share similar properties whilesamples in different subsets have different properties.

Various types of clustering algorithms have been ex-plored so far (Clarke et al., 2009; Murphy, 2012).Among them, mode-seeking is one of the popularand well-studied approaches (Fukunaga and Hostetler,1975; Cheng, 1995; Comaniciu and Meer, 2002;Carreira-Perpinan, 2015). In mode-seeking cluster-ing, data samples {xi}ni=1 are first gathered to themodes of data density p(x) by, e.g., gradient ascentx ← x + ε∇p(x), where ε > 0 is the step size and∇p(x) is the gradient of the density function p(x) withrespect to x = (x(1), . . . , x(d))>. Then, data sampleswhich converged to the same mode are given the samecluster label.

Below, we review representative mode-seeking cluster-ing algorithms.

Mean Shift Clustering: Mean shift is a seminalalgorithm of mode-seeking clustering (Fukunaga andHostetler, 1975; Cheng, 1995; Comaniciu and Meer,2002). In the mean shift algorithm, the probabilitydensity p(x) is first learned by kernel density estima-tion:

p(x) =ck,σn

n∑i=1

k(∥∥∥x− xi

σ

∥∥∥2), (1)

where k(t) is a non-negative function, σ > 0 is thebandwidth, and ck,σ is the normalization constant suchthat the integration of p(x) is equal to 1. As func-tion k(t), the exponential decaying function k(t) =exp(−t/2) is often used in practice, which yields theGaussian kernel:

k(∥∥∥x− xi

σ

∥∥∥2) = exp(− ‖x− xi‖

2

2σ2

).

The bandwidth σ can be systematically chosen bycross-validation with respect to the log-likelihood orsquared error criteria.

Next, the gradient of the kernel density estimator p(x)is computed:

∇p(x)=ck′,σn

n∑i=1

(xi − x)k′(∥∥∥x− xi

σ

∥∥∥2)= ε(x)m(x),

where ck′,σ = −2ck,σ/σ2, k′ is the derivative of k,

ε(x) :=ck′,σn

n∑i=1

k′(∥∥∥x− xi

σ

∥∥∥2) > 0,

and m(x) is called the mean shift vector (Comaniciuand Meer, 2002):

m(x) :=

∑ni=1 xik

′(‖x−xiσ ‖2)∑n

i′=1 k′(‖x−xi′σ ‖2)

− x.

To obtain the modes of data density, the mean shiftalgorithm uses the fixed-point iteration for mode-seeking. More specifically, a necessary condition forlocal maximum ∇p(x) = 0 implies m(x) = 0, whichyields x ← x + m(x). Since this fixed-point updaterule can be rewritten as x← x+ 1

ε(x)∇p(x), it can be

interpreted as gradient ascent with adaptive step size1/ε(x) > 0.

After repeating the fixed-point update for all samples{xi}ni=1 until convergence, samples which converge tothe same mode are given the same cluster label.

The mean shift algorithm has been successfully em-ployed in various real-world applications such as im-age segmentation (Wang et al., 2004; Tao et al., 2007)and object tracking (Comaniciu et al., 2000; Collins,2003).

Mean Shift Clustering on Riemannian Mani-folds: The original mean shift algorithm considersdata points in the Euclidean space. However, in prac-tice, data points sometimes lie on a structured spacesuch as the Lie group and Grassmann manifold. Fordata on such a structured space, kernel density estima-tion and gradient ascent with the Euclidean metric donot necessarily perform appropriately. To cope withthis problem, the mean shift algorithm has been ex-tended to Riemannian manifolds (Tuzel et al., 2005;Subbarao and Meer, 2006, 2009; Cetingul and Vidal,2009). Here we briefly review such an extention. Letus consider data points {Xi}ni=1 on Riemannian man-ifold Mm of dimension m ≤ d embedded in Rd.

As shown in Eq.(1), kernel density estimation in theoriginal mean shift algorithm uses the squared Eu-clidean distance between x and xi, i.e., ‖x−xi‖2. Fordata points on a Riemannian manifold, this may bereplaced with the squared geodesic distance betweenX and Xi:

p(X) =ck,σ,δn

n∑i=1

k

(δ(X,Xi)

2

σ2

), (2)

where δ(X,X ′) denotes the geodesic distance betweenX and X ′, and ck,σ,δ is a positive constant.1

1Strictly speaking, Eq.(2) is inappropriate as a density

Page 3: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, Masashi Sugiyama

Figure 1: The exponential map and the logarithmmap. The exponential map at the point X transformspoints on the tangent space at X to the manifold,while the logarithm map at X transforms points onthe manifold to the tangent space at X.

Then, the gradient of the estimated density is givenby

∇p(X) =−ck′,σ,δ

2n

n∑i=1

[∇δ(X,Xi)

2]k′(δ(X,Xi)

2

σ2

)

=ck′,σ,δn

n∑i=1

[logXXi] k′(δ(X,Xi)

2

σ2

), (3)

where ck′,σ,δ = −2ck,σ,δ/σ2. logXXi denotes the log-

arithm map of Xi at X, which satisfies the followingrelation (Boothby, 2003):

logXXi = −1

2∇δ(X,Xi)

2. (4)

In the same way as the original mean shift algorithm,Eq.(3) can be expressed as ∇p(X) = ε(X)M(X),

where ε(X) :=ck′,σ,δn

∑ni=1 k

′(δ(X,Xi)

2

σ2

)> 0 and

M(X) is the mean shift vector defined as

M(X) :=

∑ni=1 [logXXi] k

′(δ(X,Xi)

2

σ2

)∑ni′=1 k

′(δ(X,Xi′ )

2

σ2

) .

This shows that mean shift vector M(X) lies on thetangent space at X, which is denoted by TXMm (seeFigure 1).

Thus, when a point X on the manifold is updatedalong with this vector, it no longer lies on the mani-fold. To cope with this problem, the updated point isprojected back to the manifold by the exponential mapX ← expXM(X) (see Figure 1 again) (Boothby,2003).

After repeating this update for all samples {Xi}ni=1

until convergence, samples which converge to the samemode are given the same cluster label.

estimator on Riemannian manifolds: To ensure that theintegration of p(X) is equal to 1, the normalizing constant

of each k(δ(X,Xi)

2

σ2

)has to depend on Xi (Pelletier, 2005;

Arvanitidis et al., 2016). Nonetheless, Eq.(2) has been em-ployed in the mean shift algorithm because of its compu-tational efficiency and simple form (Subbarao and Meer,2009).

-4 -2 0 2x

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Density

p(x )

1(x )

2(x )

p

p

4

(a) Data densities

-4 -2 0 2x

-8

-6

-4

-2

0

2

4

6

8

Lo

g-D

en

sity

Gra

die

nt

logp(x )

log p1(x )

log p2(x )

4

(b) Log-density gradients

Figure 2: Density estimation and log-density gradientestimation. p2(x) is a better estimate of true densityp(x) than p1(x), but ∇ log p2(x) is a worse estimate oftrue log-density gradient ∇ log p(x) than ∇ log p1(x).Thus, a good density estimator is not necessarily agood log-density gradient estimator.

The mean shift algorithm on Riemannian manifoldshas been shown to work well experimentally (Tuzelet al., 2005; Subbarao and Meer, 2006, 2009; Cetinguland Vidal, 2009).

Least-Squares Log-Density Gradient Cluster-ing: Another important extension of the originalmean shift algorithm is to avoid density estimation(Sasaki et al., 2014). Kernel density estimation usedin the original mean shift algorithm tends to performpoorly when the data dimension d is high. Further-more, a good density estimator does not necessarilymean a good density gradient estimator (see Figure 2),and thus the two-step approach of first estimating thedensity and then computing its gradient does not al-ways perform well. To cope with these problems, amethod to directly estimate the density gradient with-out density estimation has been developed (Cox, 1985;Sasaki et al., 2014). Here, we briefly review the directlog-density gradient estimator called the least-squareslog-density gradient (LSLDG) and the mode-seekingclustering algorithm based on it called LSLDG clus-tering (Sasaki et al., 2014).

Let g(x) := (g(1)(x), . . . , g(d)(x))> be the gradient ofthe log-density function, where g(j)(x) := ∂j log p(x)and ∂j = ∂

∂x(j) . The key idea of LSLDG is to di-

rectly fit a model g(j)(x) to the true log-density gra-dient g(j)(x) under the squared loss:

J (j)(g(j)(x)) :=

∫ (g(j)(x)− g(j)(x)

)2p(x)dx− C

=

∫g(j)(x)

2p(x)dx− 2

∫g(j)(x)∂jp(x)dx

=

∫g(j)(x)

2p(x)dx+ 2

∫∂j g

(j)(x)p(x)dx,

where C =∫g(j)(x)

2p(x)dx and the last equal-

ity follows from integration by parts under

Page 4: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

limx(j)→±∞ g(j)(x)p(x) = 0. Then the empirical

approximation of J (j) is given as

J (j)(g(j)(x))≈ 1

n

n∑i=1

g(j)(xi)2+

2

n

n∑i=1

∂j g(j)(xi). (5)

As the log-density gradient model g(j)(x), a linear-in-parameter model is used:

g(j)(x) = θ(j)>ψ(j)(x) =

b∑l=1

θ(j)l ψ

(j)l (x),

where b denotes the number of parameters, θ(j) ∈ Rbis the parameter vector, and ψ(j)(x) ∈ Rb is the vec-tor of basis functions. By plugging this linear modelinto Eq.(5) and adding the `2-regularizer to avoid over-fitting, the following optimization problem is obtained:

θ(j)=argminθ(j)∈Rb

[θ(j)>G(j)θ(j)+2θ(j)

>h(j)+λ(j)θ(j)

>θ(j)

],

where λ(j) ≥ 0 is the regularization parameter, andG(j) and h(j) are defined as

G(j):=1

n

n∑i=1

ψ(j)(xi)ψ(j)(xi)

>, h(j):=

1

n

n∑i=1

∂jψ(j)(xi)

The optimal solution θ(j) is obtained analytically by

θ(j) = −(G(j) + λ(j)Ib)−1h(j),

where Ib denotes the b× b identity matrix. All hyper-parameters such as the regularization parameter λ(j)

and basis parameters in ψ(j)(x) can be systematicallychosen by cross-validation with respect to the squarederror criterion J (j). This direct log-density gradientestimator is called LSLDG.

To derive a mean shift like fixed-point algorithm

from LSLDG, for the Gaussian kernel, φ(j)l (x) :=

exp(−‖x−cl‖

2

2σ(j)2

), where the centers {cl}bl=1 are cho-

sen randomly from {xi}ni=1 without overlap, its partialderivative was proposed to be used as basis functions(Sasaki et al., 2014):

ψ(j)l (x) :=∂jφ

(j)l (x) =

1

σ(j)2(c

(j)l −x

(j)) exp

(−‖x− cl‖

2

2σ(j)2

).

Suppose∑bl=1 θ

(j)l φ

(j)l (x) 6= 0. Then the LSLDG so-

lution can be expressed as

g(j)(x) =

b∑l=1

θ(j)l ψ

(j)l (x)

=1

σ(j)2

b∑l=1

θ(j)l (c

(j)l − x

(j))φ(j)l (x) = ε(j)(x)m(j)(x),

where ε(j)(x) = 1σ(j)2

∑bl=1 θ

(j)l φ

(j)l (x) and m(j)(x) is

the j-th element of the mean shift vector defined as

m(j)(x) :=

∑bl=1 θ

(j)l c

(j)l φ

(j)l (x)∑b

l′=1 θ(j)l′ φ

(j)l′ (x)

− x(j).

Then, a necessary condition for local maximum∂j p(x) = 0 implies m(j)(x) = 0, which yields x(j) ←x(j) + m(j)(x). This update formula can be regardedas a weighted variant of the original mean shift algo-rithm (Cheng, 1995), and it is reduced to the original

mean shift algorithm when b = n and θ(j)l = 1/n.

The LSLDG clustering algorithm was demonstrated towork well in experiments (Sasaki et al., 2014).

3 Proposed Method

In this section, we propose to combine the Riemannianextension of the mean shift algorithm and the LSLDGalgorithm, and develop a novel clustering algorithmfor Riemannian manifolds. Below, we consider datapoints {Xi}ni=1 on a Riemannian manifold (Mm, H) ofdimension m(≤ d) embedded in Rd with Riemannianmetric H.

Direct Log-Density-Gradient Estimation onRiemannian Manifolds: If LSLDG is naively ap-plied to data on Riemannian manifolds, the estimatedgradient vector does not necessarily lie on the tan-gent space. To prevent this problem, we propose touse the common parameter vector for all dimensionswith basis functions confined in the tangent space.More specifically, let the true log-density gradient vec-tor be g(X) := ∇ log p(X) ∈ TXMm, where ∇ de-notes the Riemannian gradient. We model g(X) by

g(X) =∑bl=1 θlψl(X) ∈ TXMm, where θl is the

common parameter, b is the number of the parame-ters, and ψl(X) is the vector of basis functions whichwe assume to be on the tangent space TXMm.

This common-parameter model is fitted to the truelog-density gradient g(X) under the squared loss on amanifold :

J(g(X)) :=

∫Mm

‖g(X)− g(X)‖2Hp(X)dvolX − C

=

∫Mm

‖g(X)‖2Hp(X)dvolX

− 2

∫Mm

〈g(X), g(X)〉Hp(X)dvolX

=

∫Mm

‖g(X)‖2Hp(X)dvolX

− 2

∫Mm

〈g(X),∇p(X)〉HdvolX , (6)

Page 5: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, Masashi Sugiyama

where C =∫Mm ‖g(X)‖2Hp(X)dvolX , ‖ · ‖2H = 〈·, ·〉H

denotes the inner product operator, dvolX denotes avolume element of a Riemannian manifold (Petersen,2006), and the last equality follows from the relationg(X) = ∇ log p(X) = ∇p(X)/p(X). Applying the“integration by parts” formula for manifolds withoutboundary (Lee, 2012) to the second term in Eq.(6),we obtain∫Mm

〈g(X),∇p(X)〉HdvolX = −∫Mm

p(X)divg(X)dvolX ,

where “div” denotes the divergence (Petersen, 2006).Approximating the expectation over p(X) by theaverage of samples {Xi}ni=1 and adding the `2-regularization term, the following optimization prob-lem is obtained:

θ = argminθ∈Rb

[θ>Gθ + 2θ>h+ λθ>θ

],

where, for l, l′ = 1, . . . , b,

Gl,l′ =1

n

n∑i=1

〈ψl(Xi),ψl′(Xi)〉H , (7)

hl =1

n

n∑i=1

div(ψl(Xi)), (8)

and λ ≥ 0 is the regularization parameter. The opti-mal solution θ can be obtained analytically as

θ = −(G+ λIb)−1h.

All hyper-parameters such as the regularization pa-rameter λ and basis parameters in ψl(X) can be sys-tematically chosen by cross-validation with respect tothe squared error criterion J . Finally, the log-densitygradient estimator is given by g(X) =

∑bl=1 θlψl(X).

We call this method Riemannian LSLDG (R-LSLDG).

Mode-Seeking on Riemannian Manifolds:Based on the estimated log-density gradient g(X),we propose a mode-seeking algorithm on Riemannianmanifolds.

Let φl(X) = exp(− δ(X,Cl)

2

2σ2

), where Gaussian cen-

ters {Cl}bl=1 are randomly chosen from data samples{Xi}ni=1 without overlap. In the same way as the orig-inal LSLDG clustering, we use its gradient as basisfunction vector ψl(X):

ψl(X) = −[∇δ(X,Cl)

2]φl(X)

2σ2=

[logX Cl]φl(X)

σ2,

where we used Eq.(4).

Under the assumption that∑bl=1 θlφl(X) 6= 0 analo-

gously to the original mean shift algorithm, the mean

shift vector is given as

M(X) =

∑bl=1 θl[logX Cl]φl(X)∑b

l′=1 θl′φl′(X),

which always belongs to the tangent space TXMm.Then it is projected back to the manifold by the ex-

ponential map X ← expX M(X). After repeatingthis update for all samples {Xi}ni=1 until convergence,samples which converge to the same mode are giventhe same cluster label. We call this clustering methodR-LSLDG clustering (R-LSLDGC).

Note that the common parameter formulation ofLSLDG can be regarded as the limit of the multi-taskLSLDG method (Yamane et al., 2016).

For further improvement, we can use a different band-width for each Gaussian centerCl similarly to Comani-ciu et al. (2001). Specifically, we use a basis function

ψl(X) =1

σ2l

[logX Cl]φl(X),

where φl(X) = exp(−δ(X,Cl)2/(2σ2

l )). The meanshift vector is given by

M(X) =

∑bl=1

θlσ2l[logX Cl]φl(X)∑b

l′=1θl′σ2l′φl′(X)

,

where θl is learned by a basis function ψl(X).

Grassmann Manifold: In the experiments in thenext section, we use the Grassmann manifold (Edel-man et al., 1998) as an example of Riemannian man-ifolds. The Grassmann manifold Gd1,d2 is the set ofd2-dimensional linear subspace in Rd1 for d2 ≤ d1:

Gd1,d2 := {span(X) |X>X = Id2 ,X ∈ Rd1×d2},

where span(X) denotes the subspace spanned by thecolumns of X. Denoting by TXGd1,d2 the tangentspace on the Grassmann manifold Gd1,d2 at locationX ∈ Gd1,d2 , the canonical metric 〈·, ·〉H for the Grass-mann manifold is equivalent to the Euclidean met-ric 〈·, ·〉 (Edelman et al., 1998). Thus, 〈W ,Z〉H =tr(W>Z) holds for any W ,Z ∈ TXGd1,d2 , where

tr(A) =∑di=1Ai,i for a square matrix A ∈ Rd×d.

The exponential map for Z ∈ TXGd1,d2 is given by

expX Z = (XV cos Σ +U sin Σ)V >,

where U ,Σ, and V come from the compact singularvalue decomposition of Z, i.e., Z = UΣV >, and “cos”and “sin” for matrix Σ act element-by-element alongthe diagonal of Σ. The logarithm map for Y ∈ Gd1,d2is given by

logX Y = (Id1 −XX>)Y Y >X.

Page 6: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

The square geodesic distance between two points onthe Grassmann manifold is given by

δ(X,Y )2 = d2 − tr(Y >XX>Y ).

Thus, we have

∇δ(X,Y )2 = −2(Id1 −XX>)Y Y >X,

where we used the transformation from the partialderivative ∂

∂X f(X) of a function f(X) to the manifold

gradient ∇f(X): ∇f(X) = (Id1 − XX>) ∂∂X f(X).

Then, Gl,l′ and hl defined by Eqs.(7) and (8) can becomputed as

Gl,l′ =1

nσ2l σ

2l′

n∑i=1

d∑j=1

[F (Xi,Cl)](j)[F (Xi,Cl′)]

(j)

× φl(Xi)φl′(Xi),

hl=1

nσ2l

n∑i=1

d∑j=1

[(Id1−XiX

>i )∂F (Xi,Cl)

∂X(j)

](j)φl(Xi)

+1

nσ4l

n∑i=1

d∑j=1

[[F (Xi,Cl)]

(j)]2φl(Xi), (9)

where [ · ](j) denotes the j-th element of the vector-ization of the matrix, d = d1d2, F (X,C) = (Id1 −XX>)CC>X. Details for the derivation of Eq.(9)are deferred to the supplementary material.

4 Experiments

In this section, we experimentally compare the per-formance of the proposed R-LSLDG clustering (R-LSLDGC) algorithm with the original mean shift(MS), the Riemannian mean shift (R-MS), the LSLDGclustering (LSLDGC), and spectral clustering (SC)(Ng et al., 2002) in terms of the adjusted Rand in-dex (ARI) (Hubert and Arabie, 1985), which takes themaximum value 1 when the obtained clustering solu-tion perfectly matches the ground truth.

SC requires the number of clusters to be fixed in ad-vance. For this reason, we provide the true numberof clusters only to SC. The similarity between sam-ples used in SC is defined as exp(−‖xi −xj‖2/(2τ2)),where τ is the median of {‖xi − xj‖}ni,j=1.

The number of basis functions in LSLDGC and R-LSLDGC is set at b = min(100, n). The bandwidthin all algorithms and the regularization parameter inLSLDGC and R-LSLDGC are chosen by 5-fold cross-validation from the 8 candidates values {10−3, . . . , 10}at the regular interval in logarithmic scale.

All experiments were carried out using a PC equippedwith two 2.60GHz Intel® Xeon® E5-2640 v3 CPUs.

Toy Data: Let {Xi ∈ Rd1×d2}ni=1 be samples con-taining 3 clusters on a Grassmann manifold. Ford1 = 3, . . . , 7, d2 = 2, and n = 150, each sample Xi isgenerated as

Xi=

cos τi − sin τisin τi cos τi

O2,d1−2

Od1−2,2 Id1−2

S(cos ηi − sin ηisin ηi cos ηi

), (10)

where S is a randomly generated element on a Grass-mann manifold Gd1,d2 and Od,d′ is the d× d′ null ma-trix. For N(µ, σ2) being the normal distribution withmean µ and variance σ2, τi and ηi are generated as

τi ∼

N(

0, π2

152

)for i = 1, . . . , n3 ,

N(

2π3 ,

π2

152

)for i = n

3 + 1, . . . , 2n3 ,

N(

4π3 ,

π2

152

)for i = 2n

3 + 1, . . . , n,

ηi ∼ N(0, γ2

),

where γ = 0, π/8, π/4, π/2 controls the variance of theangle of right rotation matrix. Note that when γ iszero, the right rotation matrix is reduced to the iden-tity matrix. In Eq.(10), the left rotation matrix is ro-tation of the subspace for generating 3 clusters, whilethe right rotation matrix is rotation within the sub-space. As plotted in Figure 3, the larger γ collapsesthe cluster structure on the Euclidean space but noton the Grassmann manifold.

The ARI values are summarized in Table 1, showingthat R-LSLDGC significantly outperforms other meth-ods when γ = π/4, π/2, while SC tends to performvery well when γ = 0, π/8. However, we should notethat this comparison is not completely fair since SC isprovided with the true number of clusters, while othermethods estimate the number of clusters from data.

Compared with the plain LSLDGC, R-LSLDGC per-forms better for large γ, thanks to the geometry-awareformulation. On the other hand, the plain LSLDGCtends to outperform R-LSLDGC when γ = 0. Thisis caused by the difference in modeling: R-LSLDGCadopts the common-parameter formulation and thusonly a single model is used for jointly estimating thegradient of all dimensions. In contrast, the plainLSLDGC adopts the coordinate-wise formulation, i.e.,the gradient along each dimension is estimated sepa-rately. Due to this high flexibility, when γ = 0 (i.e., nomanifold distortion is introduced), the plain LSLDGCsometimes performs better than R-LSLDGC.

R-MS tends to outperform the plain MS and plainLSLDGC when γ = π/2, thanks to the geometry-aware formulation. However, its performance degradesas the dimension d1 increases. In contrast, R-LSLDGCperforms reliably even for large d1, which substantiates

Page 7: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, Masashi Sugiyama

Euclidean distance Riemannian distance

γ=

0

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.2

0.4

0.6

0.8

γ=π/8

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.5

1

1.5

2

2.5

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.2

0.4

0.6

0.8

γ=π/4

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.5

1

1.5

2

2.5

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.2

0.4

0.6

0.8

γ=π/2

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.5

1

1.5

2

2.5

50 100 150

Sample index

50

100

150

Sam

ple

index

0

0.2

0.4

0.6

0.8

Figure 3: Distance matrices with the Euclidean andRiemannian distances when d1 = 3, d2 = 2, and n =150. A larger γ collapses the cluster structure on theEuclidean space, while γ does not influence the clusterstructure on the Grassmann manifold.

the usefulness of direct gradient estimation without go-ing through kernel density estimation on Riemannianmanifolds.

Figure 4 plots the computation time of each methodfor different d1 averaged over γ = 0, π/8, π/4, π/2. Thegraph shows that SC, MS, and LSLDGC are quite fast,taking less than a few seconds. On the other hand,R-MS and R-LSLDGC take around 10 seconds, dueto relatively heavy calculation of the logarithm map.This is the price we have to pay for better accuracy.

Image Clustering: The MNIST data set (Lecunet al., 1998) contains 0, . . . , 9 handwritten digits im-ages. The images are down-sampled to 7 × 7 pixelsfrom its original size 28 × 28. Following the experi-mental setup in Wang et al. (2014) partially, 20 imagesare drawn from one digit class, and these images arevectorized and concatenated to form a matrix of size72 × 20 = 980. Singular value decomposition (SVD)Z = UΣV > is then applied to the matrix, and the top3 left singular vectors are used as a sample on Grass-mann manifold G49,3. Three digit classes are system-atically selected to make the 3-clusters data sets of sizen = 150, and we draw 50 samples from each digit class.

The ARI values are reported in Table 2, showing thatthe proposed R-LSLDGC outperforms other methods.

Table 1: The average and standard error of clusteringaccuracy measured by ARI (larger is better) over 20runs for toy data with d2 = 2. Bold face denotes thebest and comparable methods in terms of the meanARI according to the t-test at the significance level 5%.

γ d1 MS LSLDGC SC R-MS R-LSLDGC

3 .12 (.01) .82 (.03) 1.00 (.00) .51 (.06) .81 (.04)4 .12 (.01) .88 (.03) .94 (.04) .28 (.05) .78 (.04)

0 5 .12 (.01) .89 (.02) 1.00 (.00) .17 (.01) .77 (.05)6 .12 (.01) .90 (.03) 1.00 (.00) .22 (.03) .81 (.03)7 .09 (.01) .92 (.02) .97 (.03) .18 (.01) .85 (.03)3 .10 (.01) .62 (.03) .96 (.01) .51 (.06) .81 (.04)4 .08 (.00) .69 (.04) .95 (.03) .28 (.05) .78 (.04)

π8

5 .08 (.01) .67 (.05) .85 (.06) .17 (.01) .77 (.05)6 .09 (.01) .61 (.05) .92 (.05) .22 (.03) .81 (.03)7 .09 (.01) .66 (.04) .87 (.05) .18 (.01) .85 (.03)3 .19 (.04) .34 (.03) .50 (.04) .51 (.06) .81 (.04)4 .13 (.03) .36 (.02) .56 (.04) .28 (.05) .78 (.04)

π4

5 .05 (.02) .37 (.03) .49 (.05) .17 (.01) .77 (.05)6 .04 (.00) .31 (.03) .35 (.04) .22 (.03) .81 (.03)7 .04 (.00) .36 (.03) .44 (.04) .18 (.01) .85 (.03)3 .13 (.02) .21 (.02) .09 (.01) .51 (.06) .81 (.04)4 .10 (.02) .21 (.02) .08 (.01) .28 (.05) .78 (.04)

π2

5 .02 (.00) .15 (.02) .09 (.01) .17 (.01) .77 (.05)6 .02 (.00) .14 (.03) .11 (.01) .22 (.03) .81 (.03)7 .02 (.00) .17 (.02) .09 (.01) .18 (.01) .85 (.03)

Figure 4: The average computation time of eachmethod over all γ on toy data.

Figure 5 plots the average computation time over 20runs. R-LSLDGC takes more than 100 seconds, butits computation time is still comparable to R-MS.

Motion Segmentation: The Hopkins 155 data set(Tron and Vidal, 2007) contains feature vectors auto-matically extracted from motions sequences of framelength F (see examples of the sequences in Fig-ure 6). Under the planer scenes assumption, trajec-tories {ai ∈ R2F }ni=1 from the same motion lies on a3-dimensional subspace of R2F , i.e., Grassmann mani-fold G2F,3 (Kanatani, 2002; Subbarao and Meer, 2009).We draw 3 trajectories a1, a2, and a3 from the samemotion, and then choose the top 3 left singular vectorsfrom the matrix [a1,a2,a3] by applying SVD. Then wecreate data sets of size n = 100 or n = 150 by drawing50 samples from each motion.

The ARI values are summarized in Table 3, showingthat overall R-LSLDGC achieves higher clustering per-formance than other methods. Figure 7 plots the av-erage computation time over 20 runs. On the whole,

Page 8: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

Table 2: The average and standard error of clusteringaccuracy measured by ARI (larger is better) over 20runs for the MNIST data set. Three clusters, c1 = 0,c2 = 1, and c3 = {2, 3, . . . , 9} are picked to constructthe clustering task. Bold face denotes the best andcomparable methods in terms of the mean ARI ac-cording to the t-test at the significance level 5%.

c3 MS LSLDGC SC R-MS R-LSLDGC

2 .00 (.00) .08 (.02) .13 (.02) .12 (.01) .45 (.02)3 .00 (.00) .06 (.02) .16 (.02) .10 (.01) .41 (.02)4 .00 (.00) .06 (.02) .16 (.02) .12 (.01) .39 (.02)5 .00 (.00) .06 (.02) .16 (.01) .08 (.02) .48 (.02)6 .00 (.00) .06 (.02) .11 (.02) .03 (.01) .47 (.02)7 .00 (.00) .06 (.02) .16 (.02) .00 (.00) .42 (.02)8 .00 (.00) .08 (.02) .11 (.01) .12 (.01) .35 (.02)9 .00 (.00) .06 (.02) .17 (.02) .05 (.01) .42 (.02)

Figure 5: The average computation time of eachmethod over 20 runs for the MNIST data set.

(a) arm (2, 30, 77)(b) articulated (3,31, 150)

(c) cars1 (2, 20,307)

(d) cars3 (3, 20,548)

(e) cars5 (3, 374,391)

(f) cars7 (2, 25,502)

(g) cars9 (3, 61,220)

(h) head (2, 60, 99)(i) people1 (2, 41,504)

(j) three-cars (3,15, 173)

(k) truck1 (2, 30,188)

(l) truck2 (2, 22,331)

Figure 6: Examples of sequences (at time t = 0) fromthe Hopkins 155 data set. The number in parenthesesdenotes (#Motions, #Frames, #Trajectories).

Table 3: The average and standard error of cluster-ing accuracy measured by ARI (larger is better) over20 runs on each sequence from the Hopkins 155 dataset. The sequence names can be found in Figure 6.Bold face denotes the best and comparable methodsin terms of the mean ARI according to the t-test atthe significance level 5%.

Seq. MS LSLDGC SC R-MS R-LSLDGC

(a) .05 (.02) .09 (.01) .12 (.02) .26 (.02) .65 (.04)(b) .03 (.02) .38 (.01) .15 (.02) .76 (.01) .81 (.01)(c) .18 (.03) .08 (.01) .36 (.04) .70 (.02) .75 (.02)(d) .21 (.04) .27 (.02) .36 (.04) .72 (.01) .80 (.01)(e) .18 (.03) .03 (.02) .25 (.03) .42 (.01) .45 (.01)(f) .12 (.02) .03 (.01) .28 (.03) .51 (.01) .45 (.01)(g) .08 (.02) .06 (.01) .16 (.02) .63 (.01) .70 (.01)(h) .02 (.01) .01 (.00) .03 (.01) .01 (.00) .36 (.02)(i) .43 (.04) .34 (.02) .48 (.03) .50 (.04) .59 (.01)(j) .11 (.02) .04 (.01) .28 (.04) .65 (.05) .76 (.01)(k) .06 (.02) .02 (.01) .28 (.03) .76 (.04) .83 (.02)(l) .04 (.01) .06 (.02) .28 (.03) .59 (.02) .64 (.01)

Figure 7: The average computation time of eachmethod over 20 runs on each sequence from the Hop-kins 155 data set. The sequence names can be foundin Figure 6.

R-LSLDGC is not a computationally efficient method,but its computational time is still comparable to R-MSand it performs much better than R-MS.

5 Conclusions

Mean shift is a promising approach to mode-seekingclustering. In this paper, we extended the mean shiftclustering algorithm so that Riemannian generaliza-tion and direct gradient estimation are both incor-porated. Through experiments on Grassmann man-ifolds, we demonstrated the usefulness of the proposedmethod. In our future work, we will test the proposedmethod for other Riemannian manifolds such as Liegroups, the Stiefel manifold, and symmetric positivedefinite matrices. We will also investigate a computa-tionally efficient approximation scheme for speedup.

Acknowledgements

We thank Prof. Kazumasa Kuwada and Mr. Ikko Yamane

for fruitful discussion. MA was supported by KAKENHI

26280054, HS was supported by KAKENHI 15H06103, TS

was supported by KAKENHI 15J09111, and MS was sup-

ported by KAKENHI 26280054.

Page 9: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Mina Ashizawa, Hiroaki Sasaki, Tomoya Sakai, Masashi Sugiyama

References

B. Clarke, E. Fokoue, and H. H. Zhang. Principlesand Theory for Data Mining and Machine Learning.Springer, 2009.

K. P. Murphy. Machine Learning: A Probabilistic Per-spective. MIT press, 2012.

K. Fukunaga and L. Hostetler. The estimation of thegradient of a density function, with applications inpattern recognition. IEEE Transactions on Infor-mation Theory, 21(1):32–40, 1975.

Y. Cheng. Mean shift, mode seeking, and cluster-ing. IEEE Transactions on Pattern Analysis andMachine Intelligence, 17(8):790–799, 1995.

D. Comaniciu and P. Meer. Mean shift: A robust ap-proach toward feature space analysis. IEEE Trans-actions on Pattern Analysis and Machine Intelli-gence, 24(5):603–619, 2002.

M. A. Carreira-Perpinan. A review of mean-shift algo-rithms for clustering. Technical Report 1503.00687,arXiv, 2015.

J. Wang, B. Thiesson, Y. Xu, and M. Cohen. Imageand video segmentation by anisotropic kernel meanshift. In European Conference on Computer Vision,pages 238–249. Springer, 2004.

W. Tao, H. Jin, and Y. Zhang. Color image segmen-tation based on mean shift and normalized cuts.IEEE Transactions on Systems, Man, and Cyber-netics, Part B: Cybernetics, 37(5):1382–1389, 2007.

D. Comaniciu, V. Ramesh, and P. Meer. Real-timetracking of non-rigid objects using mean shift. InIEEE Conference on Computer Vision and PatternRecognition, volume 2, pages 142–149, 2000.

R. T. Collins. Mean-shift blob tracking throughscale space. In IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition, vol-ume 2, pages 234–240, 2003.

W. M. Boothby. An Introduction to DifferentiableManifolds and Riemannian Geometry. Gulf Profes-sional Publishing, 2003.

O. Tuzel, R. Subbarao, and P. Meer. Simultaneousmultiple 3D motion estimation via mode finding onLie groups. In Tenth IEEE International Conferenceon Computer Vision, volume 1, pages 18–25, 2005.

R. Subbarao and P. Meer. Nonlinear mean shift forclustering over analytic manifolds. In IEEE Com-puter Society Conference on Computer Vision andPattern Recognition, volume 1, pages 1168–1175,2006.

R. Subbarao and P. Meer. Nonlinear mean shift overRiemannian manifolds. International Journal ofComputer Vision, 84(1):1–20, 2009.

H. E. Cetingul and R. Vidal. Intrinsic mean shift forclustering on Stiefel and Grassmann manifolds. InIEEE Conference on Computer Vision and PatternRecognition, pages 1896–1902, 2009.

D. D. Cox. A penalty method for nonparametric es-timation of the logarithmic derivative of a densityfunction. Annals of the Institute of Statistical Math-ematics, 37(1):271–288, 1985.

H. Sasaki, A. Hyvarinen, and M. Sugiyama. Clus-tering via mode seeking by direct estimation of thegradient of a log-density. In Machine Learning andKnowledge Discovery in Databases Part III - Euro-pean Conference, pages 19–34, 2014.

B. Pelletier. Kernel density estimation on Riemannianmanifolds. Statistics & probability letters, 73(3):297–304, 2005.

G. Arvanitidis, L. K. Hansen, and S. Hauberg. Alocally adaptive normal distribution. In Advancesin Neural Information Processing Systems 29, pages4251–4259, 2016.

P. Petersen. Riemannian Geometry. Springer NewYork Inc., New York, NY, USA, 2 edition, 2006.

J. M. Lee. Introduction to Smooth Manifolds. SpringerNew York Inc., New York, NY, USA, 2012.

I. Yamane, H. Sasaki, and M. Sugiyama. Regularizedmultitask learning for multidimensional log-densitygradient estimation. Neural Computation, 28(7):1388–1410, 2016.

D. Comaniciu, V. Ramesh, and P. Meer. The variablebandwidth mean shift and data-driven scale selec-tion. In Proceedings of Eighth IEEE InternationalConference on Computer Vision, volume 1, pages438–445, 2001.

A. Edelman, T. A. Arias, and S. T. Smith. The ge-ometry of algorithms with orthogonality constraints.SIAM Journal on Matrix Analysis and Applications,20(2):303–353, 1998.

A. Y. Ng, M. I. Jordan, and Y. Weiss. On spectralclustering: Analysis and an algorithm. In Advancesin Neural Information Processing Systems 14, pages849–856, 2002.

L. Hubert and P. Arabie. Comparing partitions. Jour-nal of Classification, 2(1):193–218, 1985.

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner.Gradient-based learning applied to document recog-nition. Proceedings of the IEEE, 86(11):2278–2324,1998.

B. Wang, Y. Hu, J. Gao, Y. Sun, and B. Yin. Low rankrepresentation on Grassmann manifolds. In Proceed-ings of 12th Asian Conference on Computer Vision,pages 81–96, 2014.

Page 10: Least-Squares Log-Density Gradient Clustering for Riemannian Manifoldsproceedings.mlr.press/v54/ashizawa17a/ashizawa17a.pdf · 2018-07-24 · Least-Squares Log-Density Gradient Clustering

Least-Squares Log-Density Gradient Clustering for Riemannian Manifolds

R. Tron and R. Vidal. A benchmark for the com-parison of 3-D motion segmentation algorithms. InIEEE Conference on Computer Vision and PatternRecognition, pages 1–8, 2007.

K. Kanatani. Evaluation and selection of models formotion segmentation. In Proceedings of the Euro-pean Conference on Computer Vision, pages 335–349. Springer, 2002.