Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification

Adv Data Anal Classif (2013) 7:147–179DOI 10.1007/s11634-013-0134-6

REGULAR ARTICLE

Fuzzy spectral clustering by PCCA+: applicationto Markov state models and data classification

Susanna Röblitz · Marcus Weber

Received: 30 August 2012 / Revised: 13 May 2013 / Accepted: 16 May 2013 /Published online: 31 May 2013© Springer-Verlag Berlin Heidelberg 2013

Abstract Given a row-stochastic matrix describing pairwise similarities between dataobjects, spectral clustering makes use of the eigenvectors of this matrix to performdimensionality reduction for clustering in fewer dimensions. One example from thisclass of algorithms is the Robust Perron Cluster Analysis (PCCA+), which delivers afuzzy clustering. Originally developed for clustering the state space of Markov chains,the method became popular as a versatile tool for general data classification problems.The robustness of PCCA+, however, cannot be explained by previous perturbationresults, because the matrices in typical applications do not comply with the two mainrequirements: reversibility and nearly decomposability. We therefore demonstrate inthis paper that PCCA+ always delivers an optimal fuzzy clustering for nearly uncou-pled, not necessarily reversible, Markov chains with transition states.

Keywords Perron eigenvalues · Perturbation theory · Molecular simulations

Mathematics Subject Classification (2000) 60J10 · 62H30 · 65K05

1 Introduction

PCCA+ stands for “Robust Perron Cluster Cluster Analysis”. The method got itsname to account for the fact that it is a clustering method based on a Perron eigenvaluecluster,1 and in order to distinguish it from Principle Component Analysis (PCA).In addition, the plus sign is used to distinguish PCCA+ from its predecessor PCCA

1 The Perron eigenvalue is the unique largest real eigenvalue of a real square matrix with positiveentries.

S. Röblitz (B) · M. WeberZuse Institute Berlin (ZIB), Takustraße 7, 14195 Berlin, Germanye-mail: [email protected]

123

148 S. Röblitz, M. Weber

(Perron Cluster Cluster Analysis), which does not deliver a fuzzy (“soft”) clusteringbut a “hard” allocation of data objects. The acronym PCCA+ has been establishedover the past 10 years, though the name has been shortened to “Robust Perron ClusterAnalysis”.

Robust Perron Cluster Analysis (PCCA+) simply takes a row-stochastic matrix asinput, without caring about the origin of the data. The stochastic matrix describespairwise similarities between data objects and can be interpreted as transition proba-bility matrix of a discrete-time Markov chain, representing a random walk on a graph.Several scenarios for the origin of the data are possible. For example, the data objectscould represent a trajectory of some stochastic differential equation, or they couldhave been generated without any notion of time, approximating some submanifold orbeing randomly sampled from some probability distribution. These two cases havebeen considered in detail in the framework of diffusion maps by Belkin and Niyogi(2003), Nadler et al. (2005); Nadler et al. (2006), Coifman and Lafon (2006). Diffusionmaps are defined as the embedding of high-dimensional data onto a low-dimensionalEuclidean space via the eigenvectors of suitably defined random walks on the givendatasets. The choice of the diffusion map depends on the task at hand and will notbe discussed here. The examples on data classification presented in this paper willbe based on the classical graph Laplacian normalization, though the use of a non-Euclidean distance will be considered. In addition, we present a different scenario forthe origin of the data, namely as states of a Markov state model resulting from thediscretization of a continuum transfer operator.

Despite the existence of well-separated clusters, Markov chains resulting fromMarkov state modelling usually contain transition states, which cannot be assigneduniquely to a cluster. In many applications, e.g. peptide folding simulations withexplicit water by Fackeldey et al. (2013), these transition states have a relativelyhigh statistical weight. This leads to a large deviation of the corresponding transitionprobability matrix from an idealized block-structure, a desirable feature for spectralclustering, see e.g. Fischer and Poland (2005). In other words, the correspondingMarkov chain is no longer nearly decomposable. In addition, due to a truncationerror resulting from finite sampling in the discretization of the continuous molecularprocess, the stochastic matrices are ususally not generalized symmetric (adjoint to asymmetrix matrix), i.e. the Markov chain is not reversible, thus violating a furtherassumption required in the original PCCA algorithm.

To overcome the problem of transition states, Weber and Galliat (2002), Deuflhardand Weber (2005) developed the PCCA+ by introducing the concept of fuzzy clusteringinto PCCA, i.e. every object is assigned to all clusters with certain probabilities. Eventhough reversibility and nearly decomposability have been two main requirements inthe theoretical justification of PCCA+ by Deuflhard and Weber (2005), the methodhas reliably been applied to solve general clustering problems that do not satisfy theseassumptions, e.g. in Röblitz and Weber (2009), Metzner et al. (2010), Fackeldey etal. (2013). This paper therefore aims at explaining the theoretical justification for thisgeneralization, as well as illustrating the approach by selected examples from dataclassification and Markov state modelling. The theory is based on the fact that manystochastic matrices can be considered as small perturbations of transition probabilitymatrices of special Markov chains, which we introduce as uncoupled Markov chains

123

Fuzzy spectral clustering by PCCA+ 149

with transient states. For this ideal case, we show that there exists a unique lineartransformation of the eigenvectors of such matrices to membership vectors represent-ing a fuzzy clustering. In the general case of nearly uncoupled Markov chains withtransition states, the solution is no longer unique but we show that PCCA+ delivers afuzzy clustering that satisfies some optimality criterion.

The optimality criterion is motivated by the fact that the fuzzy clustering obtainedby PCCA+ can be used for a coarse-graining of the Markov chain that exactly preservesthe slow time-scales. This is the major advantage of PCCA+ compared to other (fuzzy)clustering methods. Unfortunately, the coarse process can in general no longer beconsidered as a Markov chain. Nevertheless, one can still define a coarse transitionmatrix, possibly with negative entries, representing the correct propagator on the coarsestate space, as demonstrated by Kube and Weber (2007). This new definition resulted ina modification of the objective function for clustering. Instead of maximizing the sumof self-transitions between clusters, PCCA+ aims at making the clusters as sharp (orcrisp) as possible, thus trying to avoid negative entries in the coarse-grained transitionmatrix. Though this idea had briefly been explained in the doctoral thesis by Röblitz(2008), it seems not to have become generally known to the users of PCCA+ and willtherefore be reviewed in this paper.

The article is organized as follows. First, we summarize the basics of spectralclustering in Sect. 2. In Sect. 3, we briefly review the original perturbation theory forPCCA based on nearly uncoupled Markov chains, before we introduce the conceptof transition states and the theory for PCCA+ in Sect. 4. Finally, in Sects. 5 and 6we demonstrate how a set of data objects or molecular simulations can be processedto compute transition probability matrices from which clusters can be identified byPCCA+.

2 Spectral clustering

Clustering deals with the problem of separating data objects o1, . . . , oN in differentclusters according to their similarities si j > 0, i, j = 1, . . . , N . A partition of Nobjects into nc clusters C1, . . . , Cnc can be represented by an indicator matrix χ =[χ1, . . . , χnc ] ∈ R

N×nc with

χ j (i) ={

1, if oi ∈ C j

0, else.

The idea of fuzzy clustering is to perform a relaxation by discarding the condition onthe discrete values for χ and instead allow χ to take values in the interval [0, 1] suchthat

0 ≤ χ j (i) ≤ 1,

nc∑j=1

χ j (i) = 1 ∀ i = 1, . . . , N .

The entry χ j (i) can be interpreted as membership value of object i with respect tocluster j . The matrix χ is therefore denoted as membership matrix.

123


The starting point for spectral clustering is to represent the data in form of anundirected similarity graph G = (V, E), where the vertices V = {v1, . . . , vN } rep-resent the objects oi . Each edge between two vertices vi and v j carries a weightwi j ≥ 0, which enters the adjacency matrix W = (wi j )i, j=1,...,N . The degree matrixD = diag(d1, . . . , dN ) is defined as the diagonal matrix with entries

di =N∑

j=1

wi j

on the diagonal.There are several possibilities to obtain the weights wi j from the similarities si j .

As long as the number n of objects is of moderate size (about ≤ 2000) such thatthe computational costs are managable, the fully connected graph with wi j = si j ispreferable, since no information is lost. For large data sets, a sparse similarity graphmay be constructed, for example the ε-neighborhood graph or k-nearest neighborgraphs.

The concept of neighborhood is usually based on pairwise distances di j (commonly,but not necessarily, the Euclidean distance) instead of similarities si j . Once the graphhas been constructed, the distances have to be transformed to similarities. A populartransformation is the Gaussian similarity function

si j = exp(−β d2

i j

), β = 1

2σ 2 . (1)

Here, the parameter σ controls the width of the neighborhoods. Since its choice iscrucial for the clustering result, alternative similarity measures with less sensitiveparameters have been developed, e.g. by Zhao et al. (2011).

Clustering is now equivalent to finding a partition of the graph such that edgesbetween different clusters have a low weight and edges within a cluster have highweight. Spectral clustering is based on the fact that the number of connected compo-nents A1, . . . , Anc ∈ V of a graph G is equal to the multiplicity of the eigenvalue zeroof the graph Laplacian

L = I − D−1W. (2)

The corresponding eigenspace is spanned by the characteristic vectors 1A1 , . . . ,1Anc

∈ {0, 1}N , whereby 1 denotes the vector with all components equal to 1 and

1A j (i) ={

1, if vi ∈ A j

0, else.

The most common spectral clustering algorithms have the following form:

1. Construct a similarity graph with weighted adjacency matrix W .2. Compute a graph Laplacian L (normalized or unnormalized).

123


3. Compute the first nc eigenvectors X = [x1, . . . , xnc ] of L .4. For i = 1, . . . , N let yi ∈ R

nc be the i th row of X . Cluster the points {yi }i=1,...,N

into clusters C1, . . . , Cnc .

Spectral clustering requires only the computation of a few eigenvectors, which is quiteeasy with standard numerical software like Matlab.

Different spectral clustering algorithms differ in the computation of the graph Lapla-cian, a possible normalization of the rows of X , and the clustering of these rows. Forexample, in normalized spectral clustering according to Shi and Malik (2000), theLaplacian is computed as in (2), but other choices are possible. In particular, Nadleret al. (2006), Coifman and Lafon (2006) have shown that different normalizations cor-respond to approximations of different continuum operators. For properties of graphLaplacians, the reader is referred to von Luxburg (2007).

The clustering of points {yi }i=1,...,N in step 4 is usually done by the k-meansalgorithm or fuzzy k-means as in Bezdek et al. (1984), Jimenez (2008), but any othermethod could be used instead. One possible choice is PCCA+. In contrast to otherclustering methods, e.g. (fuzzy) k-means, PCCA+ makes use of the special structureof the rows yi (they form a simplex), which makes the cluster result independent of anyinitialization step. The simplex structure would not occur if singular vectors insteadof eigenvectors would be used for clustering, as for example in Kannan et al. (2004).

The matrix P = D−1W is a row-stochastic matrix and can be interpreted as tran-sition matrix of a random walk which jumps from vertex to vertex. The transitionprobability of jumping in one step from vertex i to vertex j is given by pi j = wi j/di .If the graph is connected and non-bipartite, then the random walk possesses a uniquestationary distribution w = (w1, . . . , wN )T ∈ R

N given by wi = di/∑

j d j . Spectralclustering corresponds to finding a partition of the graph such that the random walkstays long within the same cluster and seldom jumps between clusters. Since P and L =I −P have the same eigenvectors, spectral clustering on L is equivalent to spectral clus-tering on P . In contrast to other (fuzzy) clustering methods, the clustering in PCCA+is obtained as a linear transformation of the eigenvectors, which preserves the slowtime-scales of the random walk as shown in Kube and Weber (2007), Weber (2013).

3 Perron Cluster Analysis (PCCA)

Perron Cluster Analysis (PCCA) was originally developed by Deuflhard et al. (2000)as a tool to lump together dynamically related states of a nearly uncoupled (or nearlycompletely decomposable) Markov chain from a given transition probability matrix.To understand the algorithm, it is instructive to first consider its behavior in the “ideal”case in which the matrix P is block-diagonal. Afterwards, its behavior in case ofperturbations is considered.

3.1 The ideal case: uncoupled Markov chains

The transition probability matrix P = D−1W represents a Markov chain on the statespace S = {o1, . . . , oN }. In case of a decomposable Markov chain or, equivalently,

123


Fig. 1 Structure of anunperturbed transition matrixwith nc stable clusters

nCnCP

11P

a disconnected similarity graph, an appropriate permutation of objects according totheir connectedness results in a block-diagonal matrix P with nc blocks, see Fig. 1.This matrix has a nc-fold eigenvalue λ = 1, called Perron root, where nc correspondsto the number of decoupled Markov chains. The coordinates of the corresponding righteigenvectors x1, . . . , xnc ∈ R

N , called Perron eigenvectors, are piecewise constant onthe blocks and can thus be used to identify the clusters as described by Deuflhardet al. (2000). They can be written as linear combinations of the characteristic vectorsχk = (χk(1), . . . , χk(N ))T ∈ R

N , k = 1, . . . , nc, of the connected components,

x j =nc∑

k=1

akjχk,

where

χk(i) ={

1, if state i belongs to cluster k

0, else.

By compiling the eigenvectors into a matrix X = [x1, . . . , xnc ] ∈ RN×nc , the char-

acteristic vectors into a matrix χ = [χ1, . . . , χnc ] ∈ RN×nc , and introducing a non-

singular matrix A−1 = (ai j ) ∈ Rnc×nc , one can write

χ = X A.

Denote by Si ⊂ {1, . . . , N }, i = 1, . . . , nc, the subset of object indices belonging tocluster (or graph component) i . Since the Perron eigenvectors are constant on eachindex subset Si , each subset may be represented by just one index li ∈ Si , leading to

A−1 =⎛⎜⎝

x1(l1) . . . xnc (l1)...

...

x1(lnc) . . . xnc (lnc)

⎞⎟⎠

Hence, in the ideal case the characteristic vectors of the connected components candirectly be obtained as a linear transformation of the right eigenvectors.

123


3.2 The general case: nearly uncoupled Markov chains

Generally, the matrix P constructed from practical data is not decomposable. Instead,there usually exist nc nearly uncoupled Markov chains, each of which is runningfor a long time in one of the subsets Si , which are therefore called almost invariantsubsets. Such Markov chains are also known as nearly completely decomposable ornearly reducible Markov chains, see Courtois (1977), Stewart (1984), Meyer (1989).In this case, a block diagonally dominant transition probability matrix P will arise aafter suitable permutation of the states. However, the perturbation theory developedby Deuflhard et al. (2000) shows that if, under some additional conditions, the matrixP is an O(ε)-perturbation of a block-diagonal matrix P

P = P + εP(1) + O(ε2), P(1) ∈ RN×N

then the transition matrix P will have a Perron cluster of eigenvalues

λ1 = 1, λ2 = 1 − O(ε), . . . λnc = 1 − O(ε),

where ε > 0 denotes some perturbation parameter, which is scaled as

ε = 1 − λ2.

Moreover, there will be a matrix of corresponding eigenvectors X = [x1, . . . , xnc ] ∈R

N×nc of the form

xi = xi + εx (1)i + O(ε2), x (1)

i ∈ RN .

In Deuflhard et al. (2000), the result

xi =nc∑

j=1

αi j χ j + ε

N∑j=nc+1

1

1 − λ jΠ j P(1)xi

︸︷︷︸Ri

+O(ε2) (3)

has been obtained, where Π j denotes the orthogonal projection on the eigenspace of Pcorresponding to the eigenvalue λi , see Kato (1984). Thus, up to perturbations of orderε, the eigenvectors are still almost constant on the strongly connected components.2

As in the ideal case, we still call these eigenvectors Perron eigenvectors, and thecorresponding subspace Perron eigenspace.

Remark 1 A similar perturbation result for spectral clustering of an arbitrary affinitymatrix with normalized graph Laplacian was derived by Ng et al. (2002). Later on, von

2 Unfortunately, the statement that Ri equals zero as given in Deuflhard and Weber (2005) turned out to bewrong, see Kube and Deuflhard (2006). Nevertheless, the result explains the robustness of the eigenvectorsunder perturbations.

123


Luxburg (2007) considered the fully connected graph as perturbation of a disconnectedsimilarity graph and discussed a perturbation result for symmetric matrices based onthe Davis-Kahan theorem.

3.3 PCCA approach

The key idea of Deuflhard et al. (2000) was to identify the almost constant level patternsof the eigenvectors by their sign structure. Once χ had been determined in this way,the linear least squares system

‖χ − X A‖ = minA

was solved for the n2c unknown entries of the matrix A in order to obtain an error indica-

tor measuring the influence of the weak modes on the coupling between the aggregatesS1, . . . Snc . However, the PCCA algorithm suffered from a lack of robustness for thefollowing reasons:

– Due to perturbations of “almost zero” levels, this method often failed in practice.– The possible number of different sign structures grows exponentially with the

number of clusters.– In many applications the departure from the completely decomposable block-

structured transition matrix is too large due to the existence of transition states.This will be explained in detail in Sect. 4.

In addition, the proof of (3) exploits that P is generalized symmetric,

D2 P = PT D2, (4)

where D2 = diag(wi ) and w = (w1, . . . , wN )T is the unique stationary distributionof P ,

wT P = wT ,

N∑i=1

wi = 1.

Equation (4) is just another formulation of the detailed balance condition,

wi P(i, j) = w j P( j, i), ∀ i, j = 1, . . . , N .

A Markov chain satisfies the detailed balance equation if and only if it is a reversibleMarkov chain.

In practice, however, we are often faced with the task to cluster a stochasticmatrix that is neither nearly decomposable nor generalized symmetric. Therefore,the improved algorithm PCCA+ and a modified perturbation theory will be presentedin Sect. 4.

123


4 Robust Perron Cluster Analysis (PCCA+)

In many of our applications the departure from the completely decomposable block-structured transition matrix is too large due to the existence of transition states. Inmolecular dynamics, for example, transition states correspond to entropic or enthalpicenergy barriers on the potential energy surface. If they have a significant statisticalweight, such transition states destroy the constant level patterns of the eigenvectorsX , i.e. they cannot be assigned uniquely to one of the clusters. Therefore, Weberand Galliat (2002), Deuflhard and Weber (2005) developed the more robust variantPCCA+ by exploiting the simplex structure of the eigenvectors for clustering. In caseof uncoupled Markov chains, the rows of the Perron eigenvectors can be consideredas vertices of an (nc − 1)-dimensional simplex. Each state can be assigned to oneof the nc vertices and thus to one of the nc blocks. Under perturbation, the simplexwill be disturbed. There will be rows which cannot be assigned uniquely to one of theblocks. Instead, PCCA+ aims at an assignment of states i ∈ {1, . . . , N } to clustersj ∈ {1, . . . , nc} with certain grades of membership χ j (i) ∈ [0, 1]. A set of vectors{χ j }nc

j=1 with χ j ∈ RN are called membership vectors if they meet the following

properties:

χ j (i) ≥ 0 ∀i ∈ {1, . . . , N }, j ∈ {1, . . . , nc} (positivity) (5)nc∑

j=1

χ j (i) = 1 ∀i ∈ {1, . . . , N } (partition of unity) (6)

Assume that we have computed the eigenvectors X of P corresponding to eigenvaluesclose to one,

P X = XΛ, Λ = diag(λ1, . . . , λnc ), 1 ≥ λi ≥ 1 − ε, (7)

and that the eigenvectors have been normalized such that

X T D2 X = I, (8)

whereby I denotes the identity matrix and D2 is the diagonal matrix with the stationarydistribution w on its diagonal, D2 = diag(w). We continue calling these eigenvectorsPerron eigenvectors, and the corresponding subspace Perron eigenspace, both in theideal as well as in the general case described below. Let χ = [χ1, . . . , χnc ] ∈ R

N×nc

be the matrix compiled of membership vectors. Together with

χ = X A, A non-singular (invariance) (9)

the conditions (5) and (6) define a (nc−1)-simplex σnc−1 ⊂ Rnc (since x1 = 1), where

all points yi (the rows of the eigenvector matrix X ) are located within the simplex.Then every point yi can be assigned to one of the nc vertices and thus to one of thenc clusters by a membership value (or probability) χ j (i). Note that indicator vectorsare just a special instance of membership vectors. They are also called crisp or hard

123


Fig. 2 Structure of anunperturbed transition matrixwith nc stable clusters and anumber of transition states,represented by the last columnsand rows

11

nCnCP

P

P PP tttc

0

membership vectors, whereas general membership vectors are also denoted as softmembership vectors or almost characteristic vectors.

Remark 2 Instead of the positivity condition, Weber and Galliat (2002), Weber (2003)had earlier introduced

∀ j ∈ {1, . . . , nc}∃ i ∈ {1, . . . , N } : χ j (i) = 1 (maximal scaling condition). (10)

Since this led to results with possibly negative membership values, the condition waslater on replaced by the positivity condition in Deuflhard and Weber (2005).

In the following, we will first demonstrate that a linear transformation of Perroneigenvectors to membership vectors always exists for a special class of matrices rep-resenting uncoupled Markov chains with transient states (the ideal case). Secondly,we will discuss the general case as a small perturbation of the ideal case.

4.1 The ideal case: uncoupled Markov chains with transient states

Assume that the N objects (or states) to be clustered can be divided into cluster states(c) and transient states (t), and that the similarity graph that contains only the clusterstates decomposes into nc disconnected components. Starting in a transient state,the Markov chain can switch to any other state, but once it reaches a cluster state,it stays in the corresponding component. Upon a suitable permutation of states, thecorresponding transition probability matrix P then has the form

P =(

Pcc 0Ptc Ptt

)(11)

whereby Pcc is a block-diagonal matrix with nc blocks as illustrated in Fig. 2.Now we want to examine the eigenvalues and eigenvectors of such a transition

probability matrix with transient states. Assume that the blocks Pj j , j = 1, . . . , nc,

are primitive and that the corresponding Markov chains are rapidly mixing. Thatmeans, each block Pj j gives rise to an eigenvalue λ j1 = 1 and to eigenvalues λ jk

well-separated away from 1. Thus, the overall matrix P has an nc-fold eigenvalueλ = 1 and a set of eigenvalues {λk}N

k=nc+1 with |λk | < 1. Of course, the complete

123


matrix P is not primitive because any exponential Pm possesses the same structure asP and hence there does not exist any m ∈ N such that Pm > 0 element-wise.

Let us first consider the stationary distribution w, which is defined as normalizedleft eigenvector of P corresponding to the eigenvalue 1,

wT P = wT , wT1 = 1.

In fact, there is no unique stationary distribution, but only a unique left invariantsubspace Y spanned by the following vectors,

Y = span

⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩

⎛⎜⎜⎜⎜⎜⎝

wC1

0...

00

⎞⎟⎟⎟⎟⎟⎠ , . . . ,

⎛⎜⎜⎜⎜⎜⎝

0...

0wCnc

0

⎞⎟⎟⎟⎟⎟⎠

⎫⎪⎪⎪⎪⎪⎬⎪⎪⎪⎪⎪⎭

,

where the vectors {wCk }nck=1 are the unique stationary distributions of the sub-chains.

In other words, the Markov chain has several equilibrium states. Note that the lastcomponents of all vectors are zero, which means that in every equilibrium state theprobability of being in a transient state is zero.

Similarly, the right eigenvectors according to the Perron root are not unique, incontrast to the corresponding subspace. In the following, we call the subspace spannedby these eigenvectors the Perron subspace. We want to show that there exists a basis ofmembership vectors for this subspace. Define a matrix χ = [χ1, . . . , χnc ] ∈ R

N×nc

and decompose it into two parts,

χ =(

χc

χ t

), χc ∈ R

(N−nt )×nc , χ t ∈ Rnt ×nc .

For χ to be a basis of the Perron subspace, it must satisfy

Pχ = λχ, λ = 1.

The equation can be rewritten in terms of χc and χt as

(Pcc 0Ptc Ptt

)(χc

χ t

)=(

χc

χ t

),

which equals the following system of equations:

Pccχc = χc and Ptcχ

c + Pttχt = χ t .

Denote by I j , j = 1, . . . , nc, the set of indices which form the block Pj j . Define theentries χc

j (i), i = 1, . . . , N − nt , j = 1, . . . , nc, of the upper matrix χc as

123


χcj (i) =

{1 if i ∈ I j

0 if i ∈ I j and i is not a transition state.

Thus, χc not only satisfies Pccχc = χc, but also the properties of membership vectors.

The matrix χ t must satisfy the system of equations

(I − Ptt )χt = Ptcχ

c,

which is equal to the systems

(I − Ptt )χtj =

⎡⎣∑

k∈I j

P(t1, k), . . . ,∑k∈I j

P(tnt , k)

⎤⎦

T

, j = 1, . . . , nc, (12)

whereby {t1, . . . tnt } denote the indices of the transient states.To show that this system has a unique solution, we have to show that I − Ptt

is non-singular. For this purpose, we can assume that Ptt is irreducible. (Otherwise,the transient states could be reordered such that Ptt becomes block-diagonal withirreducible blocks. Then one could decompose χt accordingly and split the systems(12) into further subsystems.) Now the following Lemma applies.

Lemma 1 Let A ≥ 0 be an irreducible matrix with row sums strictly smaller thanone, i.e.

∑j A(i, j) < 1 ∀i . Then any eigenvalue λ of A is located within the unit disc

(|λ| < 1).

Proof The Perron–Frobenius Theorem (Bapat and Rhagavan 1997, Thm.1.4.4)ensures the existence of a positive eigenvector y > 0 corresponding to an eigen-value λ0 > 0 that is maximal in modulus among all the eigenvalues of A. It remainsto show that λ0 < 1:

λ0 yi =∑

j

A(i, j)y j ≤ maxk

yk

∑j

A(i, j) < maxk

yk, ∀i

The inequality is especially satisfied for i = arg maxk(yk), i.e. λ0 ymax < ymax. Thusλ0 < 1. �

The matrix Ptt is in fact non-negative. The assumption that the row sums of Ptt arestrictly less than one is reasonable because there are transitions to the clusters, i.e. therow sums of Ptc are greater than zero. Consequently, the eigenvalues of Ptt are withinthe unit disc and thus I − Ptt is regular (I − Ptt is actually a non-singular M-matrix).Therefore, Eq. (12) has a unique solution. In addition, it can be shown that the solutionχ t consists of membership vectors.

Lemma 2 The solution vectors χ tj of (12) are non-negative and satisfy

∑ncj=1 χ t

j = 1.

123


Proof Since (I − Ptt ) is a non-singular M-matrix, the inverse (I − Ptt )−1 exists and is

non-negative (Bapat and Rhagavan 1997, Thm. 1.5.2). Moreover, the right hand sidesof (12) are non-negative, such that χt will also be non-negative. Furthermore, set

A ≡ (I − Ptt ), s ≡∑

j

χ tj ,

b j ≡⎡⎣∑

k∈I j

P(t1, k), . . . ,∑k∈I j

P(tnt , k)

⎤⎦

T

.

We have to show that s is the vector of ones, i.e. s = 1. Since P is stochastic, it holds∑j b j = ∑

j A(:, j). Hence

∑j

A(i, j) =∑

j

b j (i) =∑

j

∑k

A(i, k)χ tj (k) =

∑k

A(i, k)s(k).

Consequently, s satisfies A(s − 1) = 0. Since A is regular, it follows s = 1. �

To summarize, the vectors χ j meet all properties of membership vectors. They canbe interpreted in the sense of assigning a state i to a cluster j with a certain probabilityχ j (i). Since the subspace is unique, any eigenvector basis X corresponding to thePerron root λ = 1 can be transformed linearly into such membership vectors, i.e.there exists a non-singular transformation matrix A ∈ R

nc×nc such that χ = X A.Thus, the rows of any eigenvector matrix X (corresponding to eigenvalues equal to 1)of the introduced model matrix form in fact a (nc − 1)-dimensional simplex, wherebythe rows corresponding to cluster states define the vertices of the simplex, and therows corresponding to transient states are located within the simplex.

4.2 The general case: nearly uncoupled Markov chains with transition states

We now consider the case that transitions between different clusters can occur, eitherdirectly or via the previously transient states, which are no longer transient states butcalled transition states now. Upon permutation of states, the corresponding transitionprobability matrix P then has the form

P = P +(

Ecc Ect

0 0

)=(

Pcc Pct

Ptc Ptt

).

To ensure the existence of a unique stationary distribution w, we assume that P isa primitive stochastic matrix, see Brémaud (1999), Kijima (1997). Moreover, Pcc isassumed to be of block-diagonally dominant form,

123


Pcc =

⎛⎜⎜⎜⎝

P11 E12 · · · E1nc

E21 P22 · · · E2nc...

......

Enc1 Enc2 · · · Pncnc

⎞⎟⎟⎟⎠

with small error matrices Ei j such that

Pkk1 > 1 − εt , k = 1, . . . , nc, 0 ≤ εt 1

This inequality represents the condition for weak coupling between the clusters,i.e. transitions between clusters are rare events. Matrices with the above properties canthus be considered as small perturbations of the simplified model problem introducedin the previous section in Eq. (11).

Note that, in contrast to Sect. 3, the matrices P and P are no longer assumed tobe generalized symmetric. Hence, eigenvectors are no longer analytic with respectto some small perturbation parameter ε, and perturbation results in form of a seriesexpansion in terms of ε can no longer be derived.

However, the result of PCCA+ is independent of the choice of basis vectors for thesubspace X = span(x1, . . . , xnc ). Thus, it does not matter if single eigenvectors aresensitive to perturbations as long as the invariant subspace is insensitive. In fact, it canbe shown by standard matrix perturbation theory as in Stewart and Ji-guang (1990) thatunder certain conditions the invariant subspace X is insensitive with respect to smallperturbations of the matrix P . Though these conditions cannot be checked a-priori,they seem to be fulfilled in practice which explains the robustness of PCCA+, i.e. itsapplicability in the general case of nearly uncoupled Markov chains with transitionstates.

Remark 3 A problem arising in case of irreversible Markov chains is the fact that theeigenvectors might become complex valued, which is undesirable for the applicationof PCCA+. A possible way to circumvent this problem is to work with the real Schurdecomposition of A as in Stewart and Ji-guang (1990), and to apply PCCA+ to thereal Schur vectors. This is verified by the fact that the real Schur vectors span the samesubspace as the corresponding complex Schur vectors which again span the samesubspace as the corresponding eigenvectors. The Schur decomposition is a target foriterative eigenvalue algorithms such as the Jacobi-Davidson method described bySleijpen and Vorst (1996) or the implicitly re-started Arnoldi method by Lehoucq andSorensen (1996).

The solution χ constructed in the ideal case is the unique solution that satisfies theconditions on membership vectors (positivity and partition of unity), the invariancecondition, as well as the maximal scaling condition (10). This is no longer true inthe general case. As demonstrated by Weber (2006), a solution that satisfies all fourconditions usually does not exist, and dropping the maximal scaling condition (orany other of the four conditions) leads to a whole class of fuzzy clustering solutions.We will therefore demonstrate in the following how a solution can be obtained thatsatisfies some optimality criterion.

123


4.3 PCCA+ approach

Throughout this section, we will drop the tildes, which means that we write X, χ , andA instead of X , χ , and A, unless explicitly stated otherwise.

The goal of PCCA+ is to find a non-singular transformation matrix A ∈ Rnc×nc

such that

χ = X A

subject to the membership conditions (5) and (6).These conditions can be reformulated in terms of A and X in the form

A(1, j) ≥ −nc∑

i=2

xi (l)A(i, j), ∀ j ∈ {1, . . . , nc}, l ∈ {1, . . . , N } (positivity),

A(i, 1) = δi,1 −nc∑

j=2

A(i, j), ∀i ∈ {1, . . . , nc} (partition of unity),

whereby

δi, j ={

1, if i = j

0, else.

These conditions characterize the set FA of feasible transformation matrices. This sethas at least the feasible point A∗(i, j) = δi,1/nc and is therefore not empty. Sincethe constraints for FA are linear, the set is convex. Furthermore, Weber (2006) showsthat one can define a subset F ′

A ⊂ FA that includes all vertices of FA by the equalityconstraints

A(1, j) = − minl=1,...,N

nc∑i=2

xi (l)A(i, j), ∀ j ∈ {1, . . . , nc} (positivity),

A(i, 1) = δi,1 −nc∑

j=2

A(i, j), ∀i ∈ {1, . . . , nc} (partition of unity).

This reduces the dimension of the search space from n2c to (nc − 1)2 and allows for

an easy construction of feasible transformation matrices.The transformation matrix A is not uniquely defined by these conditions. In fact,

there is an uncountable number of solutions, as shown by Deuflhard and Weber (2005).Therefore, PCCA+ searches for a matrix A that optimizes some objective function .

Originally, the objective in Weber (2006), Deuflhard and Weber (2005) was tomaximize metastability, which was defined by Deuflhard et al. (2000) as

trace(

D−2c χT D2 Pχ

), (13)

123


where D2c = diag(χT w). This corresponds to the sum of the self-transition probabil-

ities of each conformation. The matrix

P := D−2c χT D2 Pχ (14)

is called coupling matrix and can be considered as a coarse grained transition proba-bility matrix. In case of crisp membership vectors χ ∈ {0, 1}N , it describes transitionprobabilities between the clusters. Thus, the objective is to maximize the holdingprobabilities within the conformations.

In the case of soft membership vectors, however, this interpretation is no longervalid. The coupling matrix has to be defined differently, as will be defined in thefollowing.

4.3.1 A new objective function

To indicate that the transition probability matrix P represents a discrete-time Markovchain with some time-step τ , we introduce awhile the notation P(τ ). If P stems frommolecular simulations, τ is the time-step at which the underlying continuous-timeMarkov process had been sampled. (For details, the reader is referred to Sect. 6.) Indata classification problems, τ is a dimensionless time constant, τ = 1.

The transition probability matrix P can be used to propagate discrete densitiesx(t) = (x1(t), . . . , xN (t))T with xi (t) ≥ 0,

∑i xi (t) = 1 via

x(t + τ) := PT (τ )x(t).

The densities can be restricted to the clusters via the membership functions χ ,

xc(t) := χT x(t).

This leads immediately to the question whether there exists a coarse grained propagatormatrix Pc such that propagation and restriction commutate in the sense that

χT PT (τ )x(t) = PTc (τ )χT x(t). (15)

In the following, we again omit the time-step τ .The question at hand has been examined in detail by Kube and Weber (2006, 2007).

Indeed, under the condition that the columns of χ span an invariant subspace of P(which is satisfied for PCCA+ because χ is a linear transformation of eigenvectors),the coarse propagator matrix

Pc =(

D−2c χT D2χ

)−1D−2

c χT D2 Pχ =(χT D2χ

)−1χT D2 Pχ

satisfies (15). In fact, using the relationship X T D2 X = I , one can show that

Pc = A−1ΛA,

123


whereby Λ is the diagonal matrix containing the nc largest eigenvalues as definedin Eq. (7). Thus, the Perron cluster eigenvalues are maintained, which preserves thetime-scales of the Markov chain. This can be considered as the main advantage ofPCCA+. Note that for crisp membership vectors χi ∈ {0, 1}N the original definition(14) is recovered, i.e. Pc = P .

Optimization of trace(Pc) makes no sense because the trace only depends on Λ butnot on A or χ , respectively. In addition, the matrix P is always stochastic whereas Pc

is not because it can have negative entries. Thus the interpretation of Pc as a Markovchain transition matrix might fail. Let us define the matrix

S = S(A; X, w) := D−2c χT D2χ.

If S was the identity, then Pc = P and Pc would be a row-stochastic matrix. Thismotivates the new objective to find a transformation matrix A such that S gets closeto the identity matrix,

‖S − Inc‖ → min .

This corresponds to a minimization of the off-diagonal entries in S, which means thatthe membership vectors must be as crisp as possible.

Since trace(S) ≤ nc = trace(Inc ), an appropriate objective function fnc is givenby

fnc (A; X, w) := nc − trace(S) → min . (16)

This is similar to the original objective function (13) with P replaced by the identity.In Weber (2006), Lemma 3.6, it is shown that in case of normalized eigenvectors

(Eq. 8) it holds

A(1, i) = (χT w)(i),

and consequently

trace(S) =nc∑

i=1

(χT Dχ)(i, i)

(χT w)(i)=

nc∑i=1

(AT A)(i, i)

A(1, i)=

nc∑i=1

nc∑j=1

A( j, i)2

A(1, i).

Thus the objective function can completely be expressed in terms of A, with thegradient given by

∇A(:, j)trace(S(A)) = A(:, j)/A(1, j).

Remark 4 Different objective functions that force S towards the identity are possible.For example, the objective function

f detnc

(A; X, w) = 1 − det(S) → min

123


can be related to holding times in metastable macro-states, see Weber (2006). A closelyrelated approach, independently developed by Korenblum and Shalloway (2003), isbased on uncertainty minimization via the objective function

f lognc (A; X, w) = −

∑i

log(S(i, i)) → min,

for which a gradient-based optimization routine has been developed by White andShalloway (2009). This objective function can be considered as maximizing the geo-metric mean of the diagonal elements of S, whereas our objective function (16) aimsat maximizing the arithmetic mean.

4.3.2 Minimizing the objective function

The objective function (16) is convex. Therefore, the optimum is attained in a vertexof the feasible set F ′

A (Weber 2006, Lemma 3.7). However, minimization of a convexfunction with linear constraints is not a trivial task. There are several possibilities tosolve the minimization problem:

1. Given a (possibly infeasible) starting guess A0 close to the optimal solution A∗,one can expand the objective function into a 1st order Taylor series around A0.Local minimization of this linear approximation with respect to the constraints(2a)–(2b) is known as linear programming and can be solved via simplex algo-rithms or interior point methods. They are part of the GNU linear programmingkit,3 which is freely available for download. This approach was also used by Whiteand Shalloway (2009). However, this procedure is time-consuming due to the largenumber of constraints. White and Shalloway (2009) propose to first identify theset of inequality constraints that are most likely to be active, and to increase thisset iteratively until all inequality constraints are satisfied. This method, however,fails for the example in Sect. 5.1, and was not efficient in many of our applications.

2. Deuflhard and Weber (2005) suggest to perform unconstrained optimization onthe (nc −1)2 elements of A(2 : nc, 2 : nc) and to transform the solution after eachiteration step into a feasible solution by imposing conditions (3a)–(3b). Sincethis routine is non-differentiable, Deuflhard and Weber (2005) proposed to usethe derivative-free Nelder–Mead algorithm (MATLAB’s fminsearch) for opti-mization. The (usually infeasible) initial guess is generated by the inner simplexalgorithm developed by Weber and Galliat (2002) (sub-algorithm A in Deuflhardand Weber (2005)). This iterative scheme with the objective function (16) is imple-mented in the current MATLAB version of PCCA+. Note that the algorithm willonly find a local maximum close to the starting guess, not a guaranteed globalone. However, as shown by Deuflhard and Weber (2005), the solution is uniqueif in addition to the properties (i)–(iii) it also satisfies the maximal scaling condi-tion (10). Thus the maximal scaling condition can be considered as an indicatorof uniqueness. For larger problems, however, the Nelder–Mead algorithm is stilltime-consuming.

3 http://www.gnu.org/software/glpk.

123

http://www.gnu.org/software/glpk


3. For non-reversible Markov chains, the number of clusters often cannot be deter-mined from the spectral gap. Other criteria are available (see Sect. 4.3.3), but theserequire to solve the clustering problem, i.e. to determine χ , for a varying number ofpossible clusters, nc. Thus, the optimization procedure has to be performed severaltimes and should be fast. We therefore suggest to use a method based on (numeri-cal) derivatives, for example the Gauss-Newton method NLSCON.4 Even thoughthis method often fails to converge in our case here due to the non-differentiability,it nevertheless delivers a solution close to a local minimum. These “nearly opti-mal” solutions can then be used to identify the correct number of clusters. For theselected number nc, the optimization can then be repeated with the Nelder–Meadalgorithm.

4.3.3 Number of clusters

Since the number of clusters nc is unknown in advance, it is recommended to run thecluster algorithm several times with different input values for nc and to choose the“best” solution. In order to evaluate the quality of the solution, several criteria can beused:

1. The spectral gap. If there are nc well-separated clusters, there will be a significantgap between the eigenvalues λnc and λnc+1.

2. The condition of the invariant subspace X spanned by the first nc eigenvectors ofP or L , respectively, see Stewart and Ji-guang (1990). In case of nc well-separatedclusters, the corresponding invariant subspace X is well-conditioned. In case of areversible Markov chain this criterion is equivalent to the spectral-gap criterion.For non-reversible Markov chains, X could be ill-conditioned even though thespectral gap is large, as demonstrated by Stewart and Ji-guang (1990). However,we rarely observe this case in our applications where we usually deal with nearlyreversible Markov chains.

3. The minChi-criterion of Weber et al. (2006). In general, the initial guess for A isinfeasible, i.e. it leads to a membership matrix χ with negative entries. However,if there exist well separated clusters, the value

minChi = mini

minj

χ(i, j)

will be close to zero. Thus, one can decide for the number nc that maximizes theminChi-value.

4. Optimality of the solution. Since trace(S) ≤ nc, one could choose the number nc

for which

nc − fnc

nc= trace(S)/nc → max (17)

This value will be referred as crispness.

4 http://www.zib.de/de/numerik/software/newtonlib.html.

123

http://www.zib.de/de/numerik/software/newtonlib.html


In general, if there are nc well-separated clusters, all proposed criteria will favor thisnumber.

Finally, the real-valued solution matrix χ can be re-transformed into a discreteindicator matrix by

oi ∈ Ck if χk(i) = maxj

χ j (i).

in order to obtain a partition of data points into crisp clusters.The PCCA+ algorithm with cluster selection criterion (17) has been implemented in

Matlab and is available from the authors upon request. The corresponding flowchartis illustrated in Fig. 3.

5 Data classification

In the first example, we consider a set of objects with given pairwise similarities,whereas in the second example only the objects are given.

5.1 A small example with transition states

The following example has been published by Bowman (2012). Though the authorthere claims that PCCA+ fails in this case, we demonstrate that the contrast is truein that PCCA+ behaves exactly as expected and delivers feasible and interpretablesolutions.

We consider 9 objects with adjacency matrix

W =

⎛⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎜⎝

1000 100 100 10 0 0 0 0 0100 1000 100 0 0 0 0 0 0100 100 1000 0 μ 0 0 0 010 0 0 1000 100 100 10 0 00 0 μ 100 1000 100 0 0 00 0 0 100 100 1000 0 μ 00 0 0 10 0 0 1000 100 1000 0 0 0 0 μ 100 1000 1000 0 0 0 0 0 100 100 1000

⎞⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎟⎠

with some perturbation μ ≥ 0. The row-stochastic matrix P is computed by P =D−1W , where D contains the row-sums of W on its diagonal. The matrix P is usedas input for the PCCA+ algorithm. Minimization of the objective function (16) hasbeen performed with MATLAB’s fminsearch (Nelder–Mead algorithm).

Figure 4 shows how the second cluster is blurred with increasing perturbation μ, ifwe aim at a decomposition into 3 clusters. However, if we use crispness, i.e. the valueof the selection criterion (17), as indicator for the number of clusters, thennc = 3 is

123


Fig. 3 Flowchart of the algorithm PCCA+ as implemented in the Matlab-file pcca.m

123


2 4 6 80

0.2

0.4

0.6

0.8

1

object

mem

bers

hip

(a)

2 4 6 80

0.2

0.4

0.6

0.8

1

object

mem

bers

hip

(b)

2 4 6 80

0.2

0.4

0.6

0.8

1

object

mem

bers

hip

(d)(c)

0 200 400 600 800 1000

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

µ

cris

pnes

s

Fig. 4 Change of membership values with varying values of the perturbation μ for a decomposition into3 clusters. The larger μ, the stronger the second cluster gets blurred, and the crispness decreases. a μ = 0,b μ = 100, c μ = 1000, d crispness

no longer optimal for large perturbations. The following table lists the optimal clusternumbers obtained by this criterion:

μ 0 10 50 100 200 500 1000

noptc 3 3 3 3 2 2 5

In fact, for μ = 1000 and nc = 5, the crispness trace(S)/nc amounts to 0.804, whereasfor nc = 3, it only takes the value 0.632. The solution for nc = 5 is illustrated in Fig. 5.

5.2 Choice of distance function

We now apply PCCA+ to a point set in R2 consisting of three nested rings as illustrated

in Figs. 6 and 7. Here, only the coordinates of the points are given, but no similaritiesor distances are defined.

Pairwise similarities si j can be computed from distances by the Gaussian similarityfunction (1). However, the results of spectral clustering are quite sensitive to the choice

123


Fig. 5 Membership values for adecomposition into 5 clusters forμ = 1000. This solution is muchcloser to a crisp clustering thanthe solution in Fig. 4c

2 4 6 80

0.2

0.4

0.6

0.8

1

object

mem

bers

hip

(a)0 200 400 600 800

0

0.2

0.4

0.6

0.8

1

(b)

Fig. 6 Example for a point set where the Euclidean distance is inappropriate to identify clusters. Theseparation between the clusters is quite weak, which is illustrated by the small membership values of thedata points. a Clusters, b membership values

of the distance function di j = d(oi , o j ). Although the Euclidean distance

di j = ‖oi − o j‖2

is a natural choice, it is not always appropriate, as illustrated in Fig. 6. Clusteringbased on the Euclidean distance assumes that the data points are grouped to “compact”clusters where the objects within one cluster are either mutually similar to each otheror they are similar with respect to a common representative or centroid. Whenever datasubsets occupy elongated regions like spiral arms or circles, an alternative distancefunction based on connectedness of data points is required. Such a distance function hasbeen introduced by Fischer and Buhmann (2002), Fischer et al. (2001) in conjunctionwith path based clustering. Path based clustering assigns two objects to the samecluster if they are connected by a path with high similarity between adjacent objectson the path. The effective distance between two objects is calculated as the minimumover all path distances,

deffi j = min

p∈Pi j (E)

{max

1≤k<|p| dp[k]p[k+1]}

.

123


(a)0 200 400 600 800

0

0.2

0.4

0.6

0.8

1

(b)

Fig. 7 Example for a point set where the use of effective distances results in the desired clustering. Theseparation between clusters is strong, illustrated by the fact that the membership vectors are nearly indicatorvectors. a Clusters, b membership values

Here, E are the edges of the fully connected graph. Pi j (E) denotes the set of allpaths from object i to object j and dp[k]p[k+1] denotes the Euclidean distance betweenobjects k and (k + 1) on the path. According to Fischer and Buhmann (2002), thematrix Deff can be computed recursively by a variant of Kruskal’s minimum spanningtree algorithm. The corresponding similarity matrix S is computed by the Gaussianfunction (1) with β = 10. Figure 7 shows that PCCA+ based on the effective distancematrix results in the expected clustering.

6 Application to Markov state models

In practice, molecular simulations are carried out by solving the Hamiltonian equationsof motion over a long period of time, and analyzing the resulting trajectory in positionspace to derive statistical information such as the mean free energy. Many molecules,however, exhibit a metastable behavior. That means, there exist a few nearly stablegeometric configurations in which the molecule oscillates for a long period of timebefore it rapidly switches to another configuration. Dellnitz and Junge (1999), Schütte(1999), Deuflhard (2003) introduced the name metastable conformations for sets ofnearly stable configurations. Since jumps between conformations are rare events, asingle trajectory must be followed over an extremely long period of time to obtainstatistically reliable results. This approach is computationally expensive.

As an example, a typical trajectory of the butane molecule, projected onto theC-C-C-C torsion angle, is illustrated in Fig. 8. The y-axis (representing the positionspace) may be decomposed into three intervals which correspond to three differentmetastable conformations (called rotamers): two gauche conformers, and an anti con-former, where the four carbon centers are coplanar. The three domains with dihedralangles of 0◦, 120◦ and 240◦ are not considered to be conformations, but are insteadtransition states.

The identification of metastable conformations together with their life times andtransition patterns is essential for the analysis of a molecule’s long term behavior. Thisis achieved by first decomposing the infinite dimensional molecular position space intoa finite number of sets. Then the dynamics is described as a Markov chain on this finite

123


Fig. 8 The long-term simulation of butane clearly shows three metastable conformations of the molecule

dimensional state space. Finally, metastable conformations are identified as clustersfrom the transition probability matrix via PCCA+. The approach is computationallyfeasible, since it can be parallelized and only requires short-term molecular dynamicstrajectories. This coarse-graining procedure leads to a reduced dynamical description,which was originally developed under the name of conformation dynamics by Dellnitzand Junge (1999), Schütte (1999), Deuflhard et al. (2000), Deuflhard (2003). Todayit is often referred to as Markov state modelling (MSM), see Chodera et al. (2007),Bowman et al. (2009), Sarich et al. (2010), Prinz et al. (2011).

6.1 Statistical mechanics

In a canonical ensemble (constant number of particles, constant volume, and constanttemperature) the state of a bio-molecule is described by a statistical ensemble in aphase space Γ . For x = (q, p) ∈ Γ = Ω × R

d the positions q and momenta p ofeach atom in the molecule are distributed according to the Boltzmann distribution

π(q, p) ∝ exp(−βH(q, p)). (18)

Here β = 1/kB T is the inverse temperature T multiplied with the Boltzmann constantkB , and H denotes the Hamiltonian function which is given by H(q, p) = V (q) +K (p), where V (q) is the potential and K (p) the kinetic energy. This canonical densitycan be split into a distribution of momenta η(p) and positions π(q) where

π(q) ∝ exp(−βV (q)) and η(p) ∝ exp(−βK (p)).

Let us consider the Hamiltonian dynamics which is given by

q = p , p = −∇V (q), (19)

123


where ∇V (q) is the gradient of an energy function (the potential) V : Ω → R.Equation (19) is the starting point for a trajectory based description of this systemin a micro-canonical ensemble (constant number of particles, constant volume, andconstant energy). In contrast, we now consider a system which is embedded in a heatbath with constant temperature T (canonical ensemble). The flow Φτ correspondingto (19) for a time span τ > 0 is given formally by

(q(τ ), p(τ )) = Φτ (q(0), p(0)).

Let Πq be the projection of the state (q, p) onto the position q and let further p bechosen randomly according to the distribution η(p), then

qi+1 = ΠqΦτ (qi , pi )

describes a Markov process with the Boltzmann distribution as stationary distribution.The i th state depends on the preceding state only.

The corresponding Liouville operator is time independent, as shown by Schütte(1999). By projecting this Liouville operator onto the position space the behavior ofthe system can be described by a transition function, which is defined by Schütte(1999) as

p(τ, f, h) =∫Ω

T τ f (q)h(q)π(q) dq, (20)

where

T τ f (q) =∫Rd

f (ΠqΦτ (q, p))η(p) dp. (21)

This construction offers many advantages for the analysis of molecular processes. Thefundamental idea behind this formulation is that the transfer operator T τ in (21) isa linear operator although the ordinary differential equation (19) is (extremely) non-linear. This linearization allows for a Galerkin discretization of T τ and thus for anumerical approximation of eigenfunctions and eigenvalues of the discrete spectrumof T τ .

We take advantage of the fact that the behavior of molecules can be well describedby its structurally related configurations (metastable conformations). Mathematicallyspeaking, a metastable conformation is a function C : Ω → [0, 1] which is nearlyinvariant under the transfer operator T τ , i.e.

T τ C(q) ≈ C(q). (22)

The identification of metastable conformations is now identical to finding a set offunctions {C1(q), . . . , Cnc (q)} such that

123


1.∑nC

J=1 CJ (q) = 1 ∀q ∈ Ω (partition of unity)2. CJ (q) ≥ 0 ∀q ∈ Ω, J = 1, . . . , nc (positivity)3. T τ CJ (q) ≈ CJ (q), J = 1, . . . , nc (invariance)

Here and in the forthcoming, we use capital I, J, . . . indices for numbering of themetastable conformations. In the following we show how the metastable conformationscan be computed by discretization of the position space and application of PCCA+.

6.2 Discretization

In order to identify the conformations we need a discretization of the position spaceΩ . For this purpose, we define a decomposition of Ω into N disjoint sets {Ωi }N

i=1 with

(i) Ωi is measurable and |Ωi | > 0 for i = 1, . . . , N(ii) |Ωi ∩ Ω j | = 0 if i = j

(iii)⋃

i Ωi = Ω

As already mentioned, the position space is high dimensional which prohibits anyusage of mesh-based methods like finite elements. Thus we take advantage of mesh-free methods, more precisely we consider a Voronoi tessellation. We define a set ofbasis functions {ϕi }N

i=1 as characteristic functions of the sets Ωi :

ϕi (q) = 1Ωi (q) :={

1 if q ∈ Ωi

0 otherwise.

The functions ϕi , . . . , ϕN satisfy

(a)∑N

i=1 ϕi (q) = 1 ∀q ∈ Ω (partition of unity),(b) ϕi (q) ≥ 0 ∀q ∈ Ω, i = 1, . . . , N (positivity).

In terms of the characteristic basis functions, the transition function p(τ, ϕi , ϕ j )

describes the transition probability between the two sets Ωi and Ω j . In other words, itdescribes the ratio of trajectories starting in Ωi with Boltzmann distributed momentap ∈ R

d and ending in Ω j after time-span τ > 0.Having now a discretization of Ω , we search for the unknown conformations

{CJ (q)}ncJ=1 as linear combinations of the basis functions {ϕi }N

i=1. More precisely

CJ (q) =N∑

i=1

χJ (i)ϕi (q), J = 1, . . . , nc. (23)

If the χJ (i) satisfy the discrete conditions

1.∑nc

J=1 χJ (i) = 1, i = 1, . . . , N (partition of unity)2. χJ (i) ≥ 0, i = 1, . . . , N , J = 1, . . . , nc (positivity)

then the CJ (q) fulfill the conditions 1. and 2. from above. Thus χJ (i) can be interpretedas assigning set Ωi to conformation CJ with probability χJ (i).

123


In order to enforce the invariance condition, we employ a Galerkin discretizationby defining an inner product for square integrable functions f, g : Ω �→ R, f, g ∈L2

π (Ω),

〈g, f 〉π =∫Ω

f (q)g(q)π(q) dq,

and inserting (23) into (22), s.th.

⟨T τ

(N∑

i=1

χJ (i)ϕi

), ϕ j

⟩π

≈⟨

N∑i=1

χJ (i)ϕi , ϕ j

⟩π

N∑i=1

χJ (i)〈T τ ϕi , ϕ j 〉π ≈ χJ (i)〈ϕ j , ϕ j 〉π . (24)

Dividing both sides of (24) by 〈ϕ j , ϕ j 〉π and defining

Pji (τ ) := 〈ϕ j , T τ ϕi 〉π〈ϕ j , ϕ j 〉π =

∫Ω

T τ ϕi (q)ϕ j (q)∫

Ωϕ j (q)π(q)dq

dq,

we obtain for the coefficient vectors χJ

P(τ )χJ ≈ χJ , (25)

where χJ = (χJ (1), χJ (2), . . . , χJ (N ))T . The stochastic matrix P(τ ) describes thetransition probabilities between the sets Ωi .

Thus, the membership vectors {χJ }ncJ=1 are exactly the result of PCCA+ applied to

the matrix P(τ ). They are obtained as linear transformation of the eigenvectors X ofP(τ ) corresponding to eigenvalues close to one, subject to the constraints 1. (partitionof unity) and 2 (positivity).

Remark 5 The computation of the above integrals is a challenging task, since theunderlying state space Ω is high dimensional. To overcome this problem, we employstrategies from particle methods. In detail, we apply Markov chain Monte Carlo meth-ods as described by Frenkel and Smit (2002) in each Voronoi cell to generate a localBoltzmann distribution πi (q) for each of the basis functions {ϕi }N

i=1, i.e.

πi (q) = ϕi (q)∫Ω

ϕi (q)π(q)dq.

The sampled positions q are propagated by molecular dynamics according to Φτ withrandomized initial momenta. With these data we compute the entries of P(τ ). This, ofcourse, introduces a sampling error, which has to be controlled carefully. For details,the reader is referred to Weber (2006), Röblitz (2008).

123


6.3 Properties of transition matrices

As explained in this section, the states of the Markov chain stem from a discretization ofsome molecular potential energy surface. Regions in configuration space that separatedifferent metastable conformations can be energetic or entropic barriers. Energy bar-riers can often be identified with saddle points of the potential energy surface, whereasentropic barriers are characterized by narrow valleys connecting different metastableregions. Particles located in transition regions will rapidly move to one of the nearbyconformations, whereas particles in cluster states will tend to stay in the states belong-ing to that specific conformation and rarely switch to a different metastable region.Roughly described, the discretization gives rise to two different kinds of states: clusterstates, which are located near the center of a metastable conformation, and transitionstates, which are located near energetic or entropic barriers. In fact, the transition prob-ability matrices P(τ ) represent nearly uncoupled Markov chains with transition states.

Due to finite sampling, the matrices resulting from the above described method arenot reversible. The reason is that the matrices are computed row-wise and that trajec-tories initiated in different basis functions have different statistical weights. Thus, asimple symmetrization by evaluating trajectories in both directions is impossible. Nev-ertheless, as long as the number of Monte-Carlo sampling points in the computationof matrix entries is large enough, the resulting Markov chain will be nearly reversible,i.e. ‖D2 P − PT D2‖ will be small. Since PCCA+ works for both reversible as well asirreversible Markov chains, this does not influence the performance of the algorithm.

6.4 Relation to diffusion maps

As demonstrated so far, in conformation dynamics or Markov state modelling, respec-tively, the data objects to be clustered are functions defined on some specific space Ω ,and the similarity matrix represents a discretization of a continuum transfer operatorthat describes how these functions evolve in time with respect to some underlyingstochastic dynamical system. Thus, PCCA+ relates the spectral properties of the dis-cretized transfer operator to probability distributions in Ω and the long-time behaviorof the complex dynamical system. This approach has some similarities with the con-cept of diffusion maps by Belkin and Niyogi (2003), Nadler et al. (2005), Nadler et al.(2006), Coifman and Lafon (2006), where the situation is just the other way round: theunderlying dynamical process is not known a priori but has to be reconstructed from afinite dataset. In particular, based on a large dataset a family of random walk processesbased on diffusion kernels is constructed, and the spectral properties of these diffusionprocesses are related to the geometry and probability distribution of the dataset.

6.5 Example

We demonstrate the application of our algorithm to the model system alanine dipeptide(Fig. 9) in vacuum with the MMFF force-field described in Halgren and Nachbar(1996). For the discretization, we chose N = 504 molecular configurations froma high temperature (1,000 Kelvin) molecular dynamics trajectory as defining nodes

123


Fig. 9 Spatial structure ofalanin dipeptid

100 200 300 400 500

100

200

300

400

500−50

−40

−30

−20

−10

(a)

2 4 6 8 100.97

0.975

0.98

0.985

0.99

0.995

1

number

eige

nval

ue

(b)

(c)(d)

2 3 4 5 6 7 8 90

0.2

0.4

0.6

0.8

1

number of clusters

cris

pnes

s

100 200 300 400 500

100

200

300

400

500−50

−40

−30

−20

−10

Fig. 10 a Image of the 504 × 504 transition probability matrix P (for ease of visualization, we plotted theelement-wise logarithm log(P) instead of P). b The first 10 eigenvalues of P . The first 5 eigenvalues forma cluster that is clearly separated from the rest of the spectrum. c The value of the cluster selection criterion(17) for different numbers of clusters. The maximum value is achieved for nc = 5 clusters. d Image oflog(P) with rows and columns resorted according to the decomposition into 5 clusters

of our Voronoi basis functions {ϕi }Ni=1. As distance measure, we use the Euclidean

distance in the space spanned by the four backbone torsion angles ω1, . . . , ω4. Wethus ignore variability in other degrees of freedom, which is justified by the factthat the torsion angles are the slow degrees of freedom we are interested in. Withinevery basis function, a Markov chain Monte Carlo method was applied to generateconfigurations distributed according to the partial densitiesπi (q). These configurationswere propagated according to the flow Φτ with τ = 39 femtoseconds. With these datawe computed the entries of the transition probability matrix P(τ ), which are visualizedin Fig. 10a. P(τ ) has a cluster of 5 eigenvalues close to one, see Fig. 10b, and the value

123


−2 0 2−3

−2

−1

0

1

2

3

ω2

ω3

−2 0 2−3

−2

−1

0

1

2

3

ω3

ω4

Fig. 11 Plot of the defining nodes of the Voronoi basis functions. Bold markers indicate that the basisfunction belongs to the cluster with probability larger than 0.8, whereas the other markers indicate mem-berships smaller than 0.8. Left torsion angles 2 and 3. Right torsion angles 3 and 4. It can be seen that theclusters identified by PCCA+ are isolated in at least one slow degree of freedom

of the selection criterion (17) is also maximal for nc = 5 (Fig. 10c). Thus we computedthe membership matrix χ for nc = 5 clusters. We also calculated the relaxation of χ

towards a hard decomposition χ , i.e.

χJ (i) ={

1, if J = arg max j χ j (i)

0, else.

This hard decomposition can be used to visualize the metastable conformations.Figure 11 shows the nodes of the basis functions colored according to this decompo-sition. Finally, the membership vectors can be used to compute statistical informationfrom the molecular dynamics trajectories, for example the mean free energy per con-formation. For details the reader is referred to Weber (2006).

7 Summary

In this paper, we demonstrated how PCCA+ delivers a fuzzy clustering in terms ofmembership vectors as a linear transformation of eigenvectors of some transitionprobability matrix representing a Markov chain on the objects to be clustered. Inparticular, we have shown that such a transformation always exists for a special classof transition matrices representing uncoupled Markov chains with transient states.These Markov chains do not need to be reversible, and the transition matrices canbe far from an ideal block structure. In the general case, which can be considered asa small perturbation from the ideal case, the transformation is no longer unique, butPCCA+ delivers a fuzzy clustering that satisfies an optimality criterion. In particular,PCCA+ delivers membership vectors that are as “crisp” as possible such that thesevectors can be used for a coarse graining of the Markov chain which preserves the slowtime-scales. We consider this to be the main advantage of PCCA+ compared to other(fuzzy) clustering methods. In addition, we discussed how the optimality criterion canbe used to select the “best” number of clusters.

123


References

Bapat RB, Rhagavan TES (1997) Nonnegative matrices and applications. Cambridge University Press,Cambridge

Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representation.Neural Comput 15:1373–1396

Bezdek JC, Ehrlich R, Full W (1984) The fuzzy c-means clustering algorithm. Comput Geosci 10(2–3):191–203

Bowman GR (2012) Coarse-grained Markov chains capture molecular thermodynamics and kinetics in nouncertain terms. arxiv.org/abs/1201.3867

Bowman GR, Beauchamp KA, Boxer G, Pande VS (2009) Progress and challenges in the automatedconstruction of Markov state models for full protein systems. J Chem Phys 131(12):124101

Brémaud P (1999) Markov Chains: Gibbs Fields, Monte Carlo simulation, and Queues. Number 31 in textsin applied mathematics. Springer, New York

Chodera JD, Singhal N, Swope WC, Pande VS, Dill KA (2007) Automatic discovery of metastable statesfor the construction of Markov models of macromolecular conformational dynamics. J Chem Phys126(155101)

Coifman RR, Lafon S (2006) Diffusion maps. Appl Comput Harmon Anal 21:5–30Courtois PJ (1977) Decomposability: Queueing and computer system applications. Academic Press,

OrlandoDellnitz M, Junge O (1999) On the approximation of complicated dynamical behavior. SIAM J Numer

Anal 36(2):491–515Deuflhard P (2003) From molecular dynamics to conformational dynamics in drug design. In: Kirkilionis

M, Krömker S, Rannacher R, Tomi F (eds) Trends in nonlinear analysis. Springer, Berlin, pp 269–287Deuflhard P, Weber M (2005) Robust Perron cluster analysis in conformation dynamics. Linear Algebra

Appl 398:161–184Deuflhard P, Huisinga W, Fischer A, Schütte Ch (2000) Identification of almost invariant aggregates in

reversible nearly uncoupled Markov chains. Linear Algebra Appl 315:39–59Fackeldey K, Bujotzek A, Weber M (2013) A meshless discretization method for Markov state models

applied to explicit water peptide folding simulations. In: Griebel M, Schweitzer MA (eds) Meshfreemethods for partial differential equations VI, volume 89 of Lecture Notes in Computational Science andEngineering. Springer, Berlin, pp 141–154

Fischer B, Buhmann JM (2002) Data resampling for path based clustering. In: Proceedings of the 24thDAGM symposium on pattern regognition, volume 2449 of Lecture Notes in Computer Science. Springer,London, pp 206–214

Fischer I, Poland J (2005) Amplifying the block matrix structure for spectral clustering. In: van Otterlo M,Poel M, Nijholt A (eds) Proceedings of the 14th annual machine learning conference of Belgium andthe Netherlands, pp 21–28

Fischer B, Zöller T, Buhmann J (2001) Path based pairwise data clustering with application to texturesegmentation. In: Energy minimization methods in computer vision and pattern recognition, volume2134 of Lecture Notes in Computer Science. Springer, Berlin, pp 235–250

Frenkel D, Smit B (2002) Understanding molecular simulation: from Algorithms to applications, volume1 of computational science series. Academic Press, London

Halgren T, Nachbar B (1996) Merck molecular force field. IV. Conformational energies and geometries forMMFF94. J Comput Chem 17(5–6):587–615

Jimenez R (2008) Fuzzy spectral clustering for identification of rock discontinuity sets. Rock Mech RockEng 41:929–939

Kannan R, Vempala S, Vetter A (2004) On clustering: good, bad and spectral. J ACM 51:497–515Kato T (1984) Perturbation theory for linear operators. Springer, BerlinKijima M (1997) Markov processes for stochastic modeling. Chapman and Hall, Stochastic Modeling SeriesKorenblum D, Shalloway D (2003) Macrostate data clustering. Phys Rev E 67:056704Kube S, Deuflhard P (2006) Errata on ”Robust Perron Cluster Analysis in Conformation Dynamics”.

December. http://www.zib.de/susanna.roeblitzKube S, Weber M (2006) Coarse grained molecular kinetics. ZIB-Report 06–35, Zuse Institute BerlinKube S, Weber M (2007) A coarse graining method for the identification of transition rates between

molecular conformations. J Chem Phys 126(2)

123

http://www.zib.de/susanna.roeblitz


Lehoucq RB, Sorensen DC (1996) Deflation techniques for an implicitly re-started Arnoldi iteration. SIAMJ Matrix Anal Appl 17(4):789–821

Metzner P, Weber M, Schütte C (2010) Observation uncertainty in reversible Markov chains. Phys Rev EStat Nonlinear Soft Matter Phys 82:031114

Meyer CD (1989) Stochastic complementation, uncoupling Markov chains, and the theory of nearlyreducible systems. SIAM Rev 31(2):240–272

Nadler B, Lafon S, Coifman RR, Kevrekidis IG (2005) Diffusion maps, spectral clustering and eigenfunc-tions of Fokker-Planck operators. In: Advances in neural information processing systems, vol. 18. MITPress, Cambridge, pp 955–962

Nadler B, Lafon S, Coifman RR, Kevrekidis IG (2006) Diffusion maps, spectral clustering and reactioncoordinates of dynamical systems. Appl Comput Harmon Anal 21(1):113–127

Ng A, Jordan M, Weiss Y (2002) On spectral clustering: analysis and an algorithm. In: Dietterich TG,Becker S, Ghahramani Z (eds) Advances in neural information processing systems, vol 14. MIT Press,Cambridge, pp 849–856

Prinz J-H, Wu H, Sarich M, Keller B, Fischbach M, Held M, Chodera JD, Schütte Ch, Noé F (2011) Markovmodels of molecular kinetics: Generation and validation. J Chem Phys 134:174105

Röblitz S (2008) Statistical error estimation and grid-free hierarchical refinement in conformation dynamics.Doctoral thesis, Department of Mathematics and Computer Science, Freie Universität Berlin. http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000008079

Röblitz S, Weber M (2009) Fuzzy spectral clustering by PCCA+. In: Mucha H-J, Ritter G (eds) Classificationand clustering: models, software and applications, number 26 in WIAS Report, Berlin. WIAS Berlin,WIAS Berlin, pp 73–79

Sarich M, Noé F, Schütte Ch (2010) On the approximation quality of Markov state models. MultiscaleModel Simul 8(4):1154–1177

Schütte Ch (1999) Conformational dynamics: modelling, theory, algorithms, and application to biomole-cules. Habilitation thesis, Department of Mathematics and Computer Science, Freie Universität Berlin

Shi J, Malik J (2000) Normalized cuts and image segmentation. IEE Trans Pattern Anal Mach Intell22(8):888–905

Sleijpen GLG, van der Vorst HA (1996) A Jacobi-Davidson iteration method for linear eigenvalue problems.SIAM J Matrix Anal Appl 17(2):401–425

Stewart GW (1984) On the structure of nearly uncoupled Markov chains. In: Iazeolla G, Courtois PJ, HordijkA (eds) Mathematical computer performance and reliability. Elsevier, New York, pp 287–302

Stewart GW, Ji-guang Sun (1990) Matrix perturbation theory. Computer science and scientific computing.Academic Press, Boston

von Luxburg U (2007) A tutorial on spectral clustering. Stat Comput 17:395–416Weber M (2003) Improved Perron cluster analysis. ZIB-Report 03–04, Zuse Institute Berlin (ZIB)Weber M (2006) Meshless methods in conformation dynamics. Doctoral thesis, Department of Mathematics

and Computer Science, Freie Universität Berlin. Verlag Dr. Hut, MünchenWeber M (2013) Adaptive spectral clustering in molecular simulations. In: Giusti A, Ritter G, Vichi M

(eds) Classification and data mining. Springer, Berlin, pp 147–154Weber M, Galliat T (2002) Characterization of transition states in conformational dynamics using Fuzzy

sets. ZIB-Report 02–12, Zuse Institute BerlinWeber M, Rungsarityotin W, Schliep A (2006) An indicator for the number of clusters using a linear map to

simplex structure. In: Spiliopoulou M, Kruse R, Borgelt C, Nürnberger A, Gaul W (eds) From data andinformation analysis to knowledge engineering, studies in classification, data analysis, and knowledgeorganization. Springer, Berlin, pp 103–110

White B, Shalloway D (2009) Efficient uncertainty minimization for fuzzy spectral clustering. Phys Rev E80:056704

Zhao F, Liu H, Jiao L (2011) Spectral clustering with fuzzy similarity measure. Digit Signal Process21:701–709

123

http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000008079

http://www.diss.fu-berlin.de/diss/receive/FUDISS_thesis_000000008079

Documents

Fuzzy spectral clustering by PCCA+: application to Markov state models and data classification