Clustering SVD Master Thesis

7/27/2019 Clustering SVD Master Thesis

1/86

CLUSTERING DATASETS WITH SINGULAR VALUE

DECOMPOSITION

A thesis submitted in partial fulfillment of the requirements for the

degree

MASTER OF SCIENCE

in

MATHEMATICS

by

EMMELINE P. DOUGLAS

NOVEMBER 2008

at

THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON

Approved by:

Dr. Amy Langville, Thesis Advisor

Dr. Ben Cox

Dr. Katherine Johnston-Thom

Dr. Martin Jones

Dr. Amy T. McCandless, Dean of the Graduate School:


2/86

2009

Copyright 200 by

'RXJODV(PPHOLQH3

.

All rights reserved


3/86

ABSTRACT

CLUSTERING DATASETS WITH SINGULAR VALUE

DECOMPOSITION

A thesis submitted in partial fulfillment of the requirements for the

degree

MASTER OF SCIENCE

in

MATHEMATICS

by

EMMELINE P. DOUGLAS

NOVEMBER 2008

at

THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON

Spectral graph partitioning has been widely acknowledged as a useful way to cluster

matrices. Since eigen decompositions do not exist for rectangular matrices, it is

necessary to find an alternative method for clustering rectangular datasets. The

Singular Value Decomposition lends itself to two convenient and effective clustering

techniques, one using the signs of singular vectors and the other using gaps in singular

vectors. We can measure and compare the quality of our resultant clusters using an

entropy measure. When unable to decide which is better, the results can be nicelyaggregated.

1


4/86

Contents

1 Introduction 5

2 The Fiedler Method 9

2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.1.1 Clustering with the Fiedler vector . . . . . . . . . . . . . . . . 11

2.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2 Extended Fiedler Method . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.1 Clustering with Multiple Eigenvectors . . . . . . . . . . . . . . 14

2.2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3 Moving from Eigenvectors to Singular Vectors 17

3.1 Small Example Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.2 How to Cluster a matrix with SVD Signs . . . . . . . . . . . . . . . . 21

3.2.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . 24

3.2.2 Why the SVD Signs method works . . . . . . . . . . . . . . . 25

3.2.3 Limitations of SVD signs . . . . . . . . . . . . . . . . . . . . . 30

3.3 SVD Gaps Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.3.1 When are Gaps Large Enough? . . . . . . . . . . . . . . . . . 34

3.3.2 Results on Small Yahoo! dataset . . . . . . . . . . . . . . . . 35

4 Quality of Clusters 37

2


5/86

4.1 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4.2 Comparing results from Small Yahoo! Example . . . . . . . . . . . . 42

5 Cluster Aggregation 44

5.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . . . . . 47

6 Experiments on Large datasets 49

6.1 Yahoo! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.2 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

6.3 Netflix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

7 Conclusion 59

A MATLAB Code 61

A.1 SVD Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

A.2 SVD Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

A.3 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

A.4 Cluster Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

A.5 Other Code Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

3


6/86

Acknowledgements

I owe a great debt to the many people who helped me to successfully complete this

thesis. First and most of all, to my thesis advisor Dr. Amy Langville who not only

listened to countless research updates, presentations, and MATLAB complaints, but

who encouraged me and gave me the confidence to start a thesis in the first place.

Second to Kathryn Pedings who was always a ready and willing soundboard whenever

I was excited or frustrated and needed someone to talk to. Also I would like to thank

my committee members for their interest, criticism, and encouragement. Last, but

not least, I would like to thank my fiance, Andrew Aghapour, for being patient with

me and my mood swings as I finished this thesis.

4


7/86

Chapter 1

Introduction

Since the introduction of the internet to the public in the late 1980s, life has become

much more convenient. Instead of making trips to the bank, the mall, the post

office, and the grocery store, a person can manage their finances, pay their bills,

shop for gifts, and even buy groceries from the comfort of their own home. As lucky

as the consumers think they are, the internet has actually proven to be a greater

boon to the companies providing these conveniences. Now companies can easily

gather information about their customers that they may not have known before,

which, in the end, will help them make even more sales to even more customers.

For example, the internet company Netflix invests a great deal of time and money

collecting data about their customers movie preferences. They use this information

to make recommendations to other customers. Better recommendations may result

in more rentals and, more importantly, higher customer loyalty. However, moving

from data collection to movie recommendation is not a trivial task, as the datasets

inevitably grow quite large. One useful way to glean information from these massive

datasets is to cluster them.

Clustering is a data mining technique which reorganizes a dataset and places

objects from the dataset into groups of similar items.1 When the dataset is rep-

1Note to the Reader: Clustering should not be confused with classification; classification names

5


8/86

resented as a matrix, clustering is essentially reordering the matrix so that similar

rows and columns are near each other. Datasets can be created in hundreds of dif-

ferent structures and sizes, therefore it makes sense that the methods used to cluster

them are just as abundant and varied. Hundreds of different clustering techniques

have been developed by both mathematicians and computer scientists over the years;

these techniques can be grouped into two main catagories: hierarchical and partitional

[28].

Hierarchical: In Hierarchical algorithms, clusters are created in a tree-like

process by which the dataset is broken down into nested sets of clusters based on some

measure of similarity between objects. An example diagram describing this process

is shown in Figure 1.1.

Figure 1.1: Tree diagram for a hierarchical clustering algorithm. Any vertical cutwould result in a clustering.

Hierarchical algorithms can be subdivided into groups: the more widely usedagglomerative (or bottom up) methods, and the divisive (or top down) methods.

Linkage methods such as the Nearest Neighbors algorithm, which forms clusters by

grouping objects that are nearest to each other, and the Centroid Method, which

or qualifies the groups and clustering does not, though it might be used as a means to that end (i.e.datasets may be easier to classify once they have been clustered).

6


9/86

chooses central objects and then clusters the other objects according to their proximity

to either centroid, are good examples of popular agglomerative clustering algorithms

[8]. Divisive algorithms work in the opposite direction, starting with the full dataset

as one cluster and then splitting it into smaller and smaller pieces. However, these

techniques tend to be more computationally demanding, and, as mentioned before,

are not as popular as the agglomerative methods. More examples of hierarchical

techniques along with some discussion of their merits and disadvantages can be found

are summarized by Everitt et al. in [12].

Partitional: Partitional algorithms work by dividing the dataset into disjoint

subsets. Principal Direction Divisive Partitioning, or PDDP, which divides a dataset

into halves using the principal direction of variation of the dataset as described by

Boley in [9], falls into this catagory along with the other Singular Value Decomposition

(SVD) based algorithms presented in Chapter 3. Spectral methods, or clustering

algorithms that analyze components of the eigen decomposition, are also partitional

algorithms. One partional technique that has gotten a lot of attention lately uses

the PageRank vector to cluster data [2]. Independent Component Analysis, or ICA,

analyzes and divides a dataset so that objects between clusters are independent, and

objects within clusters are dependent [3], [17]. The k-means algorithm is a very

popular partitional algorithm that has long been upheld as a standard in the field

of clustering due its efficiency, flexibility, and robustness. The algorithm divides the

data into k groups centered around the k cluster centers that must be chosen at the

outset of the algorithm [19]. A nice comparison of many different algorithms, both

hierarchical and partional, is presented by Halkidi et al. in [16].I go more into detail about the history of spectral clustering in Chapter 2 since

this particular field gave birth to the SVD clustering methods introduced in Chapter

3; the first, SVD Signs, is an algorithm outlined by Dr. Carl Meyer in [22], and

the second, SVD Gaps is my own SVD-based clustering algorithm. Of course, after

7


10/86

introducing these two SVD clustering methods, some measure of cluster goodness

is necessary in order to compare the two methods directly. Therefore, in Chapter 4

an entropy measure will be introduced that can be used to measure the how well a

dataset has been clustered. Many times, though, it is useful to be able to find an

average clustering when one algorithm does not stand out above the rest, so Chapter

5 introduces some ideas about cluster aggregation, a way to combine the results from

several clustering algorithms, that will be helpful in cases where multiple algorithms

produce good clusterings. Throughout these chapters small datasets will be used as

examples to help clarify how the algorithms work, and what a well-clustered matrix

looks like. However, as mentioned above, datasets tend to be very large in real life.

Thus, it is crucial that clustering methods work as well on these large datasets as

they do on the smaller ones. In Chapter 6 the results of my experiments with the

SVD Signs and SVD Gaps algorithms on three large datasets (one from Yahoo!, one

from Wikipedia, and one from Netflix) will be presented.

The images in this paper have been created using several different computer

programs. I used MATLAB for some of the simpler images such as line graphs, and

the Apple Grapher application to display small three dimensional datasets. All of the

images of large matrices were created with David Gleichs VISMATRIX tool [11].

8


11/86

Chapter 2

The Fiedler Method

Though graph theoretic clustering has been used heavily by computer scientists for in

recent decades, the understanding of these methods is rather new. The mathematics

behind these methods were not explored until the late 1960s and early 1970s. In 1968,

Anderson and Morely published their paper [18] on the eigenvalues of the Laplacian

matrix, which is a special matrix in graph theory will be defined in Section 2.1. Then

in 1973 and 1975, Miroslav Fiedler published his landmark papers [14] and [13] on

the properties of the eigensystems of the Laplacian matrix. Fiedlers ideas were not

applied to the field of clustering until Pothen, Simon, and Liou did so with their

1990 paper [24]. These papers are the origins of spectral graph partitioning, sub-

field of clustering that uses the spectral or eigen properties of a matrix to identify

clusters. There are two methods in the spectral category that inspired the SVD-based

clustering methods introduced in Chapter 3: the Fiedler Method and the Extended

Fiedler method.

2.1 Background

The Fiedler Method takes its name from Miroslav Fiedler because of two important

papers he published in 1973 and 1975 that explored the properties of eigensystems

9


12/86

of the Laplacian Matrix. What is the Laplacian matrix? Consider this small graph

with 10 vertices or nodes that are connected by several edges.

Figure 2.1: Small graph with 10 nodes and its adjacency matrix.

We can easily represent this graph with a binary adjacency matrix, where the

rows and columns represent the 10 nodes, and non-zero entries represent the edges

between nodes. Any graph, small or large, can be fully represented by a matrix. Once

we have an adjacency matrix, the corresponding Laplacian matrix L can be found by

L=

D

A,(2.1)

where A is the adjacency matrix, and D is a diagonal matrix containing the row sums

of A. Figure 2.2 shows the Laplacian matrix for the 10 node graph given above.

Figure 2.2: Finding the Laplacian matrix for the adjacency matrix in Figure 2.1

These matrices can be used to discover important properties of the graphs they

10


13/86

represent. Most importantly, a Laplacian matrix can give us information about the

connectivity of the graph it represents. In fact, in his papers [14] and [13], Miroslav

Fiedler proved that the eigenvector corresponding to the second smallest eigenvalue,

which is now called the Fiedler vector, can tell us how a graph can be broken down into

maximally intraconnected components and minimally interconnected components. In

other words, the Fiedler vector is a very useful tool for partitioning the graph. A more

in-depth look at spectral graph theory can be found in Chungs Spectral Graph Theory

[10]. Some more important results about spectral partitioning are shown in [26] and

[20].

2.1.1 Clustering with the Fiedler vector

Suppose we have the graph from Figure 2.1 with its corresponding Laplacian matrix

(see Figure 2.3).

Figure 2.3: Graph with 10 nodes and its Laplacian matrix.

From the Laplacian matrix we obtain the eigen decomposition (Figure 2.4

shows the eigenvectors and eigenvalues from the Laplacian matrix in Figure 2.3. No-

tice that the smallest eigenvalue is 0, and its corresponding eigenvector is a scalar

multiple of the identity vector, as is the case with all Laplacian matrices. The eigen-

vector we are interested in is the Fiedler vector, which is circled. We can use the

11


14/86

signs of this eigenvector to cluster our graph. This clustering method is known as the

Fiedler Method.

Figure 2.4: Eigenvectors and eigenvalues of L, with second smallest eigenvalue andthe Fiedler vector circled.

The rows with the same sign are placed in the same cluster. Therefore, for

the 10 node example, nodes 1, 2, 3, 7, 8, and 9 are in one cluster while nodes 4, 5, 6,

and 10 are in another cluster. Looking at Figure 2.5, we can see this partition makes

a lot of sense in the context of the graph. As expected, the Fiedler Method cut the

graph into two better connected subgraphs.

Figure 2.5: Signs ofv2, the Fiedler vector, and the partition made by the first iterationof the Fiedler Method.

The next step is to take each subgraph and partition each with its own Fiedler

vector. We will only do this with the larger sub graph, since the cluster containing

nodes 4, 5, 6, and 10 is fully connected and so it does not make sense to cluster it

12


15/86

further. However, as we can see in Figure 2.6, the second iteration of the Fiedler

Method works very nicely on the second half of the graph.

Figure 2.6: Partition made by the second iteration of the Fiedler Method.

For this small graph, two iterations are sufficient for a satisfactory clustering.

Though, as graphs become larger, certainly many more iterations are necessary. The

algorithm stops when no more partitions can be made such that the number of edges

between two clusters is less than the minimum number of edges within either of

the two clusters. Barbara Ball presents a much more thorough explanation of this

algorithm in [4].

The Fiedler Method has been shown to perform very well in experimentation.

Some experimental results are given in [31], [4], and [15].

2.1.2 Limitations

Though this method is theoretically sound, and has been shown to work very nicely on

large as well as small square symmetric matrices, it does have drawbacks. First, the

Fiedler Method is iterative. Therefore, if at any point a questionable partition is made,

the mistake is exacerbated by further iterations. Also, new eigen decompositions must

be found at every iteration, which can be expensive for larger datasets.

Secondly, the Fiedler Method only works for square symmetric matrices. Many

13


16/86

different symmetrization techniques have been developed for non-square or non-

symmetric matrices, but inevitably some information contained in the matrix is lost

whenever symmetry is forced.

Though it is still based on the eigen decomposition, the next clustering algo-

rithm does not carry the drawbacks of an iterative procedure.

2.2 Extended Fiedler Method

In the last section we considered the application of the Fiedler vector to the problem of

clustering. Surely the Fiedler vector is not the only eigenvector that can be of service.

In fact, Extended Fiedler finds much success by incorporating multiple eigenvectors.

2.2.1 Clustering with Multiple Eigenvectors

Since the rise of the Fiedler Method, many mathematicians have developed a similar

clustering algorithm, referred to as the Extended Fiedler method, that uses multiple

eigenvectors (see references [1] and [5] for such algorithms).

The Extended Fiedler method follows the same preliminary steps as the Fiedler

Method, but diverges when it comes to the actual clustering. Instead of looking at

the signs of one eigenvector, Extended Fiedler looks at the sign patterns of multiple

eigenvectors.

Algorithm 1 Extended Fiedler Let L be the Laplacian matrix for a symmetric

matrix A.

1. Find Vk, a matrix containing the first k eigenvectors of L and Ek, a diagonal

matrix containing the first k eigenvalues of L in ascending order such that Vi is

an eigenvector with the ith eigenvalue in Ek.

2. Look at signs of columns 2 throughk of V

14


17/86

3. If rowsi and j have the same sign pattern, then rows i and j of A belong in the

same cluster.

The algorithm extends to as many eigenvectors as the user deems necessary.

If k vectors are used, then up to 2k (but often fewer) clusters result.

To help explain Extended Fiedler, I will demonstrate how it clusters the 10

node graph from Section 2.1 when k = 2 eigenvectors are used. Note first that the

algorithm finds only 3 clusters (see Figure 2.7), which is less that the potential of

4 cluster. Second, notice that for this example the Extended Fiedler method and

the Fiedler Method produced the exact same clustering of this dataset. Also, I only

needed to calculate one eigen decomposition. Some experimental results with this

algorithm are detailed by Basabe in [5].

Figure 2.7: Results of Extended Fiedler on the 10 node graph when using the signsof 2 eigenvectors.

2.2.2 Limitations

Though Extended Fiedler frees us from the iterative processes of the Fiedler Method,

we are still bound to and limited by the eigen decomposition which only exists for

square matrices, and only has real-valued eigenvalues and eigenvectors when the ma-

trix is symmetric. The next Chapter moves on to the related but more flexible singular

15


18/86

value decomposition, and the two SVD-based clustering methods, which do not have

as many limitations as the two Fiedler algorithms. .

16


19/86

Chapter 3

Moving from Eigenvectors to

Singular Vectors

It would be nice to find a clustering method as simple and robust as the extended

Fiedler method but that is more flexible and extends to rectangular matrices. Ob-

viously, the main obstacle in our way is the decomposition being used. Is there a

decomposition for both rectangular and square matrices that has the same structure

as the eigen decomposition?

As it turns out, the Singular Value Decomposition accomplishes this goal. A

unique SVD is defined a matrix of any size, and the SVD of a square matrix is related

to its eigen decomposition. SVD is not as widely known or studied as the eigen

decomposition, so we will define it now.

Definition 1 [21] The singular value decomposition of an m n matrix A with rank

r is an orthogonal decomposition of a matrix into three matrices such that

Amn = UmrSrrVTrn (3.1)

.

17


20/86

Figure 3.1: A Diagram for the Singular Value Decomposition of a matrix

This decomposition is called orthogonal since the columns of U are orthogonal

to each other. The same holds for the rows of VT. The matrix S is a diagonal matrix

that contains the singular values of A in descending order. Singular values are always

non-negative real numbers [21]. In Section 3.2, the three components of the SVD, U,

S, and VT, will be addressed in greater detail.

As mentioned above, the eigen decomposition and the singular value decom-

position are closely connected. Suppose B and C are square symmetric matrices such

that when B = AAT

and C = AT

A for some rectangular matrix A with singular

values si, left singular vectors ui, and right singular vectors vi. Then

Bui = s2

i ui, (3.2)

and

Cvi = s2

i vi. (3.3)

Therefore, ui is an eigenvector ofB with eigenvalue s2

i , and vi is an eigenvector

ofC with eigenvalue s2i [28]. Hence, ifA is a square symmetric matrix, then the eigen

decomposition ofA and the singular value decomposition of A are equivalent.

The Singular Value Decomposition is an exact decomposition. In other words,

we can multiply U, S, and VT and get back the original matrix A. However, we can

18


21/86

also use SVD to find an approximation of A of rank k, where k < r, by multiplying

only the first k columns of U, the first k values in S, and the first k rows of VT, as

in the diagram below. This is called the truncated SVD of a matrix.

Figure 3.2: A Diagram for the truncated Singular Value Decomposition of a matrix

This truncated SVD not only gives us a rank k approximation to a matrix A,

it gives us the best possible rank k approximation in the following sense:

Theorem 1 (Eckert and Young; see [21])Let A be matrix of rank r, and let Ak be

the SVD rank k approximation to A, with k r, and let B be any other matrix of

rank k. Then

A AkF A BF (3.4)

When k


22/86

of these values drops, forming a sort of elbow on the line graph. If the elbow

roccurs at the jth singular value, we might set k = j.

Figure 3.3: A line plot of the singular values of a matrix where the y-axis represents

the magnitude of a singular value. Since the graph drops sharply 4, it is reasonableto set k = 4.

Because of these properties, the singular value decomposition has many appli-

cations outside the field of clustering as well. Mike Berry and others established the

usefulness of SVD with respect to information retrieval in [6] and [7], and Skillicorn

has applied it to counterterrorism in [27].

3.1 Small Example Datasets

The next two sections cover the algorithms for my two SVD-based clustering methods.

In these sections, it will be useful to have a small example matrix to demonstrate how

well each method clusters and reorders the given matrix. For this purpose, a small

subset of 45 rows, or search phrases, and 24 columns, or advertisers, was chosen from

the large Yahoo! dataset (the full Yahoo! matrix is introduced and used in Chapter

6). The rows and columns were selected so that the small matrix has the nice block-

diagonal structure associated with well-clustered matrices, and then randomized it

by means of a random permutation. This way we have an answer key with which to

compare our results.

20


23/86

Figure 3.4: Small Yahoo! example matrix, along with the terms represented by eachrow, before (left) and after (right) randomization.

Notice there are row labels for the dataset, but no column labels. Yahoo was

willing to release the terms represented by the dataset, but kept the advertisers names

a secret for business and privacy reasons. This does not present a great obstacle, sincewe can still get a very good idea of how a clustering algorithm performs based on

how the rows have been reordered.

I will also use two other small matrices that each represent a different set of

points shown in Figures 3.5 and 3.6 in three-dimensional space. These will be helpful

visuals when discussing the geometric aspects of each clustering algorithm.

3.2 How to Cluster a matrix with SVD Signs

The first SVD based clustering methods to be discussed is the SVD signs method,

which uses the sign patterns of the singular vectors rather than eigen vectors as done

by the Extended Fiedler method.

21


24/86

Figure 3.5: 11 by 3 matrix as a set of eleven points in three-dimensional space.

Figure 3.6: Another set of points used in this chapter.

As discussed earlier, eigenvectors hold a lot of information about a graphs

connectivity, and this information is exploited by the Fiedler clustering methods.

Since the eigen decomposition and singular value decomposition are so closely related,

it is not surprising that the singular vectors also carry a wealth of information about

the matrices they represent, and so play a central role in both SVD Signs and SVD

Gaps methods of clustering.

For both methods, the first step is to find the truncated SVD of the matrix.

This will give us three matrices Umk, Skk, and VTkn, for some chosen k, where

the columns of U are the dominant k, or the k vectors that contribute most to the

dataset, left singular vectors of A, the entries in S are the dominant singular values

22


25/86

of A, and the rows of V contain the dominant right singular vectors.

Once the truncated SVD is obtained, the SVD signs algorithm uses the sign

patterns of the singular vectors to group the rows and columns in precisely the same

way as the sign patterns of eigenvectors were used in the extended Fiedler method.

Rows that have the same sign pattern in the first k singular vectors are grouped

together.

For example, Figure 3.7, below, shows the first two left singular vectors from

some matrix A. Using the sign patterns from these two vectors, rows 1 and 4 would

be clustered together, rows 2, 5, 6, and 7 would be clustered together, and row 3

would be placed in a cluster by itself. Note that if we use k singular vectors, we can

have up to 2k clusters, since each row ofUk has k entries, each with 2 possible values.

Luckily, the algorithm rarely yields such a high number of clusters.

Figure 3.7: Clustering by using sign patterns of the first two singular vectors

Why were the left singular vectors used here instead of the right ones? Recall

that earlier this was not an issue because we used the eigen decomposition which

has only had one set of eigenvectors. On the other hand the SVD gives two sets of

singular vectors. Which set of singular vectors, the left or the right, should be used?

Note also that the spectral methods only dealt with square symmetric matrices, and

so one reordering could be applied to both rows and columns. This is not the case for

SVD signs, which can be used on rectangular matrices, as well as asymmetric square

23


26/86

matrices, and so calls for two independent re-orderings. This is where the two sets of

singular vectors come in handy - The signs of the left singular vectors, or the

columns of U, give a clustering for the rows, while the signs of the right

singular vectors, or the columns of V, can be used to cluster the columns.

Algorithm 2 SVD Signs

1. Find [Uk, Sk, VTk ] = svds(A, k)

2. If rows i and j of Uk have the same sign pattern, the rows i and j of A are in

the same cluster.

3. If columns i and j of VTk have the same sign pattern, the columns i and j of A

are in the same cluster.

3.2.1 Results on Small Yahoo! Dataset

The SVD signs method performs very well on the small Yahoo! example, as can

be seen in Figure 3.8. First, the reordered matrix has the very nice block diagonal

structure that is characteristic of well-clustered matrices. Second, we can see that,

with one exception, all the terms were returned to their original categories. Note that

we chose k = 3 here. It turns out that using three left singular vectors gives the best

clustering of the rows, even though the singular values (shown in Figure ??) suggest

a k value of 4. This is a good example of how problematic choosing a k value can be.

The next obvious question one might ask is: why does this work? Since SVD is

an orthogonal decomposition, it has a very nice geometry. In fact, it is the geometrical

properties of the SVD, which are explained and proved by Meyer in [22], that give a

clear explanation for why this SVD Signs method works so well.

24


27/86

Figure 3.8: The singular values of the Small Yahoo! dataset and the results of using

signs method with k = 3 on the dataset.

3.2.2 Why the SVD Signs method works

Note that any mn matrix can be thought of as a set of m points in an n-dimensional

space. For example, Figure 3.9 shows an 113 matrix and the corresponding cloud

of 11 points.

Figure 3.9: 11 by 3 matrix as a set of eleven points in three-dimensional space.

In this geometrical context, the right singular vectors represent the vectors

of principle trend of the data cloud. In other words, the first right singular vector

(referred to from now on as v1) will point in the direction of highest variation in the

data cloud. However, when we plot the data cloud and its first right singular vector

25


28/86

together, as in Figure 3.10, it certainly does not look like the vector is pointing in the

direction with the most variation.

Figure 3.10: First right singular vector

This is because the dataset has not been centered, and v1 looks for the direction

of highest variation from the origin. After centering the dataset, as in Figure 3.11, v1

does point in the accurate direction of principal trend.

Figure 3.11: First right singular vector of centered dataset

The second right singular vector, v2, points in the direction of secondary trend

orthogonal to v1, and v3 points in the direction of tertiary trend orthogonal to both

v1 and v2.

Not only do these three right singular vectors represent the three directions of

principal trend in the dataset, but they also represent a new set of axes for our set of

26


29/86

Figure 3.12: The three right singular vectors of the dataset

points!

Figure 3.13: The right singular vectors can be thought of as a new set of axes for thedataset.

In this particular example, the dimension of the original dataset and the di-

mension of the new space created by the right singular vectors are the same because

all of the right singular vectors were used. What happens when the truncated SVD,

rather than the full SVD is used? In other words, what if instead of using all n right

singular vectors, we decide to use only k of them? In this case, the original dataset

of dimension n is projected into a space of lower dimension k. Why would anyone

want to do this? Wouldnt a lot of the information in the original dataset be lost?

Of course, as with any projection of this nature, some of the information will be lost.

But the information lost will be the least important, and sometimes even superfluous,

27


30/86

information that could be obfuscating important correlations in the original matrix.

This is because the Singular Value Decomposition naturally sorts trends in the matrix

from most important to least important [28]. Therefore, in most cases, losing extra

dimensions does not pose a problem, and can even be helpful.

Now let us consider the left singular vectors. What is their role? It was shown

by Meyer [22] that the left singular vectors give the coordinates of the points on the

new set of axes created by the right singular vectors! In other words, u1 contains the

orthogonal projections of each point onto v1.

Figure 3.14: The left singular vectors contain the coordinates of the points whenprojected onto each right singular vector

This information about the geometry of SVD can now be used to better un-

derstand the SVD Signs clustering method discussed above. When the signs of u1

are used to divide the cloud of points into two pieces, all the points projected onto

the positive half ofv1

are placed in one cluster, and all the points projected onto thenegative half ofv1 are placed in another cluster. This is essentially the same as slicing

through the centered set of points at the origin with a hyperplane that is orthogonal

to v1. Figure 3.15 demonstrates this with the set of 11 points, and shows the two

clusters that result.

28


31/86

Figure 3.15: Points on the positive side of the hyper-plane are in one cluster, whilepoints on the negative side are in another (left). The dataset is now divided into twoclusters (right).

These steps are repeated with the signs ofu2, resulting in another hyper-plane

orthogonal to the first that divides the set of points with respect to v2. As shown

in Figure 3.16, these two planes divide the space into quadrants, with each quadrant

containing a different sign pattern and therefore a different cluster.

Figure 3.16: Points are further clustered according to their quadrant.

Next, the set of points is divided with respect to the signs of u3, resulting in a

third hyper-plane that divides the space into octants. Note that not all of the octants

happen to contain points, and so fewer than eight, or 23, clusters result.

It is possible to go too far with the method and divide the set of points into

too many groups. Notice that, while the first two singular vectors resulted in very

29


32/86

Figure 3.17: Points further clustered according to their octant.

intuitive clusterings, some of the clusters resulting from the third vector, particularly

the clusters circled in Figure 3.18 are questionable. This represents the consequence

of moving from k = 2 to k = 3 when clustering this set of points. Is it better to

have a clustering that is too fine, or not fine enough? There is really no good answer

to this question. The most appropriate k depends on the goals of and applications

envisioned by the researcher.

Figure 3.18: Cost of choosing a k that is too high.

3.2.3 Limitations of SVD signs

Like the Fiedler and Extended Fiedler methods before it, SVD signs clusters strictly

according to positive and negative signs. It breaks the dataset into halves according

30


33/86

to whether the projection of each point lies on the positive or negative half of a vector.

But what if the data set doesnt naturally break in half? Or if the break point lies

somewhere other than the middle of the dataset? For example, What if SVD Signs

were applied to the following trimodal data set introduced earlier:

Figure 3.19: Example of a trimodal dataset

Since the SVD SIgns method can only bisect a dataset at any given iteration,

it would divide this set of points into two halves, cutting right through the middle

clump of points as shown in Figure 3.20.

Figure 3.20: Signs method breaks dataset in half, rather than into thirds as desired.

Clearly, dividing this set of points into three clusters by splitting it at two

places along the direction of principal trend would be preferable, but SVD Signs

31


34/86

simply does not have that capability. It was with this flaw in mind that we set out

to create an SVD clustering method that can tailor itself to the shape of a dataset.

3.3 SVD Gaps Method

What if, instead of using the signs of the left singular vectors to blindly cut through

a dataset at its center, we considered the gaps in the left singular vectors instead?

Remember that these vectors contain orthogonal projections of the points in the

directions of principal, secondary, and tertiary trend. Therefore, we can use them to

find the gaps between points in any of these directions and divide the dataset where

the gaps occur.

Figure 3.21: Gaps preserved when a dataset is projected onto a singular vector.

For example, when we look at the first left singular vector of the tri-modal

dataset introduced earlier, the large gaps can be found quite easily. As seen in Figure

3.22, this new gaps method would cut through the center of these gaps and therefore

create the three clusters we desired earlier.

The algorithm for SVD Gaps follows many of the same steps as the Signs

method, but introduces a few important changes. As with the Signs method, the

32


35/86

Figure 3.22: First singular vector of trimodal dataset with the gaps between entries,and the resulting cuts.

truncated SVD, using the appropriate rank k, must be found, and then the gaps

between entries of the left singular vectors are calculated. If a gap between two entries

is large enough, a division is placed between the corresponding rows of the original

matrix. Again, clustering the columns of a matrix is similar. The only difference

being that the algorithm uses gaps in the right singular vectors to determine where

to divide the columns.

Algorithm 3 SVD Gaps

1. Find [Uk, Sk, Vk] = svds(A, k)

2. For 1 i k, sort Ui (or Vi if clustering columns) and find the gaps between

entries.

3. If the gap between rows j and j + 1 of Ui (Vi) is large enough then divide A

between the corresponding rows (columns).

4. Create a column vector Ci that contains numerical cluster labels for Ui (Vi) for

all rows (columns).

5. After findingCi for all 1 i k, compare cluster label patterns for rows of C.

33


36/86

6. If rows (columns) i and j have the same cluster label pattern in C, then rows

(columns) i and j belong in the same cluster.

3.3.1 When are Gaps Large Enough?

Earlier, the issue of when gaps are large enough was glossed over. However, this is

a very important component of the Gaps algorithm. In fact, the effectiveness of the

entire method rests on deciding when a gap in the data is large enough to be used as

a cut. If the criteria are too relaxed, there will be far too many clusters. However, if

the criteria are too stringent the algorithm might not discover as many clusters as it

should.So how big is big enough? Obviously, the measure should be relative to the

data. For example, some datasets might have smaller gaps over all, and so have

smaller significant gaps, than other datasets. Therefore, the significance of a gap

should be tied to the average size of the gaps in the data. However, the points in

a dataset can be greatly spread out in the direction of v1, but more compact in the

direction of v2. In this case, it would not be a good idea to pool all the gaps from

all the singular vectors together, as this would result in too many cuts in the earlier

vectors and too few cuts in the later vectors. It makes sense, then, for an average to

be taken with respect to each individual singular vector, which is exactly what the

SVD Gaps algorithm does.

Now that we have an average gap size for each singular vector, we know we

should only choose gaps that are larger than the average gap, but how much larger

should it be? It would make a lot of sense if, before making this decision, the algorithm

took into account how spread out the gap sizes were. Therefore, a standard deviation

for the gaps of each singular vector should also be found. From there, it is easy to

calculate how many standard deviations away each gap is from the average gap (this

might bring to mind the z-score of the normal distribution, but we must remember

34


37/86

that these gaps are most likely not normally distributed). Any gap that is more than

a certain number of standard deviations larger than the average gap will be chosen as

a place to cut the dataset. We can use Chebyshevs result about general distributions

to help us decide the cut-off or tolerance level for the number of standard deviations;

experimentation with various datasets has shown that using gaps that are 1.5 to

2.5 standard deviations larger than the average yields nice clusters. The number of

standard deviations is a parameter of the SVD Gaps method, and can be set by the

user.

3.3.2 Results on Small Yahoo! dataset

SVD Gaps also performs fairly well on the small Yahoo! dataset (see Figure 3.23

below). I chose a value of 3 for k, again, and a tolerance level (i.e. the standard

deviation cut-off for the gaps) of 2.15.

Figure 3.23: Small Yahoo Dataset clustered with the SVD Gaps algorithm usingk = 3 and tol = 2.15

The reordered picture for SVD Gaps does not look quite as nice as the one

for SVD Signs, but this does not necessarily mean that the clustering is not as good.

However, by looking at the reordered list of terms, we can see that all of the terms

35


38/86

are returned to their original group. Are the clusterings equal in strength, since they

got essentially the same term reordering? It is hard to decide based simply on the

appearance of the reordered matrix and reordered list of terms. The next chapter

introduces a more rigorous way to compare the performances of the two algorithms.

36


39/86

Chapter 4

Quality of Clusters

With small examples, such as the tri-modal set of points and the small Yahoo! dataset,

determining whether a clustering algorithm works well is easy and can be done visually

(especially when the dataset was created with specific clusters in mind, as was the case

with both of these). If we want to be able to apply either of these algorithms to real

world problems, however, a more rigorous measure of the quality of the clustering,

or cluster goodness, must be found. Such a measure allows one to decide which

algorithm is better: SVD Signs or SVD Gaps. With these goals in mind, lets take a

look at an entropy measure.

4.1 Entropy Measure

The entropy measure used in this paper for clustering is based on the measure pre-

sented by Meyer in [23] and revolves around the concept of surprise, or the surprise

felt when an event occurs.

Definition 2 [23] For an event E such that 0 < P(E) = p 1, the surprise S(p)

elicited by the occurrence of E is defined by the following four axioms.

1. S(1) = 0

37


40/86

2. S(p) is continuous with p

3. p < q S(p) > S(q)

4. S(pq) = S(p) + S(q)

Basically, events with lower probabilities elicit a higher surprise when they

occur, and vice versa; the function for surprise in terms of the probability turns out

to be

S(p) = logp (4.1)

where S(1) = 1 [23]. Now, ifX is a random variable, then the entropy of

X is the expected surprise of X.

Definition 3 For a discrete random variable X whose distribution vector is p, where

pi = P(X = xi), the -entropy of X is defined

E[S(X)] = HX = n

i=1

logpi (4.2)

Set t log t = 0 when t = 0.

Before we can apply this measure to clustering we must resolve a problem

with the nature of rectangular datasets. One issue that makes these datasets difficult

to work with, is that the rows must be clustered and reordered independently of

the columns. Therefore the reordered matrix does not often have the nice, clean,

block diagonal structure that we get from symmetric reorderings like the Fiedler or

Extended Fiedler methods (see Figure 4.1).

So, even if a matrix has been clustered and reordered well, it can be hard to

tell by looking at the reordered matrix. However, if a matrix A is clustered well and

reordered to A, then R = AAT, or the reordered row by reordered row matrix, and

38


41/86

Figure 4.1: This matrix has very well defined clusters, even though the matrix is not

in block diagonal form.

C = ATA, or the reordered column by reordered column matrix, both have nice block

diagonal structures with few nonzero entries outside of the blocks. For R these blocks

represent the row clusters, and for C they represent the column clusters. Figure 4.2

shows a perfectly block diagonal matrix, as well as one that has a few stray points. If

these matrices had been reordered by a clustering algorithm, we would say that the

one on the left had been clustered better than the one on the right.

Figure 4.2: Two block diagonal matrices 1 and 2

When we look at R and C for each of these matrices (Figures 4.3 and 4.4),

it is even more clear that the first matrix has been clustered better than the second

39


42/86

one.

Figure 4.3: The reordered row by reordered row matrices for 1 and 2

Figure 4.4: The reordered column by reordered column matrices for 1 and 2

Now this idea of entropy can be applied to clustering in the following way

(which is nicely laid out and fully explained by Meyer in [23]). Say we have a set

of distinct objects A = {A1, A2,...,An}, each classified with a label Lj from the set

{L1, L2,...,Ln}, and say we group these objects into k clusters {C1, C2,...,Ck}. Then

we can create a probability distribution containing probabilities pij such that

pij =number of objects in Ci labeled Lj

number of objects in Ci(4.3)

The entropy of an individual cluster Ci is then

40


43/86

Hk(Ci) = k

j=1

pij logkpij. (4.4)

We set pij logkpij = 0 when pij = 0. Therefore the entropy of the entire clustering or

partition is

H =r

i=1

iHk(Ci) where i =|Ci|

n. (4.5)

This entropy measure H has the following properties:

0 H 1

H = 0 if and only if Hk(Ci) = 0 for all i = 1,...,k

H = 1 if and only if Hk(Ci) = 1 for all i = 1,...,k

In other words the entropy scores for a clustering will range from 0 to 1 with 0 being

the best and 1 being the worst.

What if a dataset is not labeled? In fact, none of the datasets presented in

the next chapter are labeled. In these instances there is a clever way to force labelson a dataset that has been clustered. Consider the small clustered 15 by 10 matrix

shown below in Figure 4.5.

First we divide R according to the row clusters, which in this case will result

in 3 row blocks and 3 column blocks . Then for each row of R we create a vector of

ratios of the number of nonzeros in each column block over the number of columns

in that block. For example, for the first row of R in Figure 4.5, the ratio vector will

be {44

, 17

, 04

}. If the ratio for block j of row i has the highest ratio, then that row will

be labeled j. Obviously, row one of R will be labeled 1. The second row is a more

interesting case as the ratio vector is {44

, 57

, 44

} and so we have a tie between blocks 1

and 3. Since a row can only have one label for our entropy measure, we pick the first

block with the highest ratio, and so row 2 of R will be labeled 1.

41


44/86

Figure 4.5: A clustered 15 by 10 matrix A and the corresponding row by row matrixR.

Most of the row labels for R match with the clustering, except for row 6. The

ratio vector for this row is {44

, 37

, 04

}, and so this row will be labeled 1 even though is

in cluster 2. Therefore the entropy measure for R, or the row entropy measure for A,

will be less than perfect. It turns out to be 0.1742.

This measure, though not the method of labeling, is presented by Meyer in

[23], and resembles the method presented in [25]. A similar cluster measure is also

presented in [19], though here the entropy measure is relative to a perfect partition and

so is not useful when such a partition is not known. A nice synopsis and comparison

of several different cluster measuring techniques is presented in [16].

4.2 Comparing results from Small Yahoo! Exam-

ple

The row re-ordering of the small Yahoo! dataset for SVD Signs is perfect and has a

row entropy of zero! SVD Gaps does not do quite as well, but didnt do too badly,

42


45/86

with a row entropy of 0.0279 Figure 4.6 shows the reordered row matrices side by

side.

Figure 4.6: Symmetric reordered small Yahoo! term matrix for SVD Signs reorderingwith entropy 0 (left) and for SVD Gaps with entropy 0.0279

The column re-orderings were worse for both algorithms. The SVD Signs

column reordering had an entropy of 0.1332, and it is obvious in Figure 4.7 that the

column clusters of SVD Signs are not as good as its row clusters. SVD Gaps had a

clearly worse score of 0.4493 for its column reordering.

Figure 4.7: Symmetric reordered small Yahoo! column matrix for SVD Signs reorder-ing with entropy 0.1332 (left) and for SVD Gaps with entropy 0.4493

43


46/86

Chapter 5

Cluster Aggregation

No matter how strong the theoretical or experimental evidence behind a clustering

algorithm may be, nothing is perfect. All algorithms will have their strengths and

weaknesses, and sometimes one algorithms weakness may be another algorithms

strength. Thus it makes sense to find a way to combine them and bring out the best

from several algorithms. This is exactly what cluster aggregation does.

Before I give my algorithm for cluster aggregation I will demonstrate it on

a small example. Suppose three different clustering algorithms are used on a small

dataset containing eight objects, resulting in the three clusterings shown in Figure

5.1.

Figure 5.1: Small example of three different clusterings

44


47/86

We can use these clusterings to build a graph that represents the relationships

between the clusterings. For instance, since object 4 and 8 are clustered together for

two of the algorithms, then there is an edge with weight 2 between nodes 4 and 8 on

the graph shown in Figure 5.2. If two objects have no common clusters, then there

is no edge between them.

Figure 5.2: Graph representing three different clusterings of 8 objects

Of course, any graph can be easily translated to an adjacency matrix A, as

in Figure 5.3, where the nodes become rows and columns and the edges become the

entries of the matrix. For instance, the 4th row of the 8th column and the 8th row of

the 4th column are both 2. In general, if objects i and j have n common clusters,

then Aij = Aji = n.

Figure 5.3: Creating an adjacency matrix from the graph

45


48/86

This adjacency matrix represents all three of the clustering algorithms, and

clustering this matrix yields a nice aggregation of the three methods. Of course, this

begs the question, what clustering method should be used? After all, if we knew the

best clustering method, there would be no need for cluster aggregation in the first

place. However, notice that the adjacency matrix is a square symmetric matrix. For

these types of matrices, spectral methods, specifically those which use the signs of

the eigenvectors as in [5] and [1] have actually been proven to be the best clustering

algorithms. Therefore, I have chosen to use the method layed out in [5], and described

in Section 2.2 of this thesis, to cluster the adjacency matrix. The results for the small

example are shown in Figure 5.6.

Figure 5.4: Results of Cluster Aggregation with small example

The simplicity of this aggregation method gives it a surprising amount of

flexibility and therefore a wide range of uses. In the small example used here, all of

the algorithms were given equal weight in the adjacency matrix, and therefore seen as

equally valid. However, in some instances, one or more of the clusterings might stand

out above the rest. In these cases, it is easy to adjust the values in the adjacency

matrix to reflect this disparity. Aggregation might also be used when a proper k value

cannot be discerned for a given matrix. Instead of settling on one value ofk that may

not be optimal, several values can be chosen, and their corresponding results can be

aggregated.

46


49/86

Many aggregation techniques have been introduced in the field of data mining,

and some are quite similar to the one presented above. The aggregation algorithm

introduced in [29] a hyper-graph that is equivalent to the adjacency matrix used

here to capture the relationships between different clusterings. Another interesting

aggregation algorithm is the one presented in [30], which is modeled after the way

ant-colonies sort larvae and also makes use of the hyper-graphs presented by [29].

5.1 Results on Small Yahoo! Dataset

Figure 5.5: Adjacency Matrix of Small Yahoo! Dataset using clusters from SVD Signsand SVD Gaps

47


50/86

Figure 5.6: Clustered adjacency matrix and the sorted term labels

48


51/86

Chapter 6

Experiments on Large datasets

All the time spent researching and creating these algorithms is wasted if the algo-

rithms themselves are useless when applied to large data sets. After all, one of the

main reasons that clustering is so important is that it can be useful for breaking down

and processing large amounts of information. Therefore, it is important at this point

to demonstrate the results of the two SVD based clustering algorithms when applied

to a couple of large data sets.

6.1 Yahoo!

The first data set is a binary 3,000 by 2,000 matrix complied by Yahoo! The matrix

represents the relationships between 3,000 search terms and 2,000 (anonymous) ad-

vertisers. If an advertiser j bought a given search phrase i, then there is a 1 in the

ijth entry of the matrix; all other entries are 0.

The image below, created with David Gleichs VISMATRIX tool, gives an idea

of what the raw dataset looks like before reordering (the terms are in alphabetical

order, and advertisers are in random order). Blue dots represent ones, and zeros are

represented by whitespace.

Before we run each of the algorithms on the full Yahoo! dataset, we must

49


52/86

Figure 6.1: Vismatrix display of the raw 3000 by 2000 Yahoo! dataset.

choose a value for k. Since the singular values can often be helpful for making this

decision, Figure 6.2 shows a line graph of the first 100 singular values of the Yahoo!

matrix.

Figure 6.2: Plot of the singular values of the Yahoo! dataset.

Unfortunately, the singular values plotted in Figure 6.2 do not provide a very

clear answer. We can see that k = 20 would probably be too early a cut-off. I finally

chose to compare the results for k = 25 and k = 35. Figure 6.3 shows the results for

50


53/86

SVD signs.

Figure 6.3: Results of the SVD Signs algorithm on the full Yahoo! dataset for k = 25(left) and k = 35 (right).

SVD Signs performed quite nicely in both of these trials. In both, there are

plenty of nice dense rectangles along the diameter. On exploring these clusters, we

find that the terms in each cluster have similar themes: one contains search terms

related to hotels, one contains terms related to online gambling, etc. Which choice

for k yields better results? For k = 25, the row entropy is 0.0555 and the column

entropy is 0.0482, whereas for k = 35 the average row and column entropies are

It is clear from the pictures and entropy scores that using k = 25 results

in better clusters; using a higher k results in too many clusters for both rows and

columns. Next let us look at how the SVD Gaps algorithm performed. Figure 6.4

shows the results using a tolerance level of 2.3. The SVD Gaps clustering got a row

entropy of 0.1023 and a column entropy of 0.1421.

How do the results compare? As we see in Figure 6.5, the SVD Signs method

produced a much cleaner block diagonal structure than SVD Gaps. But, SVD Gaps

seems to have found more dense clusters than SVD Signs. This seems to indicate that

51


54/86

Figure 6.4: SVD Gaps results using tol = 2.3

SVD Gaps is good at finding really strong clusters, but not good at finding weaker

clusters.

Figure 6.5: Side by side comparison of results from SVD Signs (left) and SVD Gaps(right).

52


55/86

Figures 6.6 and 6.7 show side by side images of the reordered search phrase by

reordered search phrase matrices, and the reordered advertiser by reordered advertiser

matrices for SVD Signs and SVD Gaps so we can compare the row clusters and the

column clusters for the two algorithms

Figure 6.6: Term by Term matrices for SVD Signs (left) and SVD Gaps (right)

Figure 6.7: Column by Column matrices for SVD Signs and SVD Gaps

53


56/86

Again, the results for SVD Signs do look stronger at first glance, however

although SVD Gaps did not perform as well on the dataset as a whole, it still did

a better job on the really dense clusters. Also, We might conclude that SVD Gaps

tends to focus only on the really strong clusters in a dataset and ignore the weaker

ones. In some applications, this would be a very helpful trait.

6.2 Wikipedia

The next large dataset, shown in Figure 6.8, is a binary matrix representing links

between 5,176 Wikipedia articles by 4,107 categories. Each article can be placed in

multiple categories, and each category can contain several articles.

Figure 6.8: Vismatrix display of the raw 5176 by 4107 Wikipedia dataset, and a plotof its first 100 singular values

The results ofSVD Signs and SVD Gaps on the Wikipedia dataset are shown

side by side in Figure 6.9. Neither algorithm produced a strong block diagonal struc-

ture, but this is due to the nature of the dataset, and does not necessarily mean that

54


57/86

the dataset was clustered poorly by either algorithm.

Figure 6.9: Side by side comparison of the SVD Signs and SVD Gaps results for theWikipedia dataset.

Though the SVD Gaps reordering does not look as nice, it actually got higher

entropy scores. The row entropies for SVD Signs and SVD Gaps were 0.1257 and

0.0956 respectively, while the column entropies were 0.3573 and 0.2013 resepectively.

For this dataset, it is much easier to compare the results by looking at the reordered

article by reordered article (see Figure 6.10) and reordered category by reordered

category matrices (see Figure 6.11).

Looking at Figure 6.10, it is more apparent that SVD Gaps produced a better

clustering for the Wikipedia dataset. The structure is much cleaner, and there areeven lots of nice clusters within clusters, which are also present for SVD Signs but

are not as well defined.

The column reorderings show a similar story. The reordering for SVD Signs

looks more interesting than that of SVD Gaps, but the clusters found by SVD Signs

55


58/86

Figure 6.10: Reordered row by reordered row matrices for SVD Signs (left) and SVDGaps (right).

Figure 6.11: Reordered column by reordered column matrices for SVD Signs (left)and SVD Gaps (right).

are rather weak, and we can see there are a lot of groups of objects that are outside

of yet rectilinearly aligned with the clusters. This suggests that many columns have

been mis-clustered. SVD Gaps, on the other hand produced only very small and

56


59/86

dense clusters, and there is not much noise in the rows and columns aligned with

these clusters.

6.3 Netflix

The Netflix dataset used here is a 280 user by 17,770 movie dataset containing the

users ratings (from 1 through 5, 5 being the best) of movies rented through Netflix.

All the users in this dataset rated at least 500 movies. An image of this matrix is

shown below in Figure 6.12.

Figure 6.12: Vismatrix display of the raw 280 by 17,770 Netflix dataset and a plot ofits singular values.

The results for both methods, shown in Figure 6.13, are interesting. Both

methods seem to have gathered as much data as possible into a few clusters in the

outside columns, and neither algorithm found very many column clusters.

This seems odd, since the number of columns is so large, but makes more

sense when the odd nature of the dataset is considered. Remember that the dataset

only represents users who have rated more than 500 movies. Thus, every person

represented in the matrix is a Netflix superuser who must like movies a great deal,

57


60/86

Figure 6.13: Results for SVD Signs with k = 7 and for SVD Gaps with k = 7 andtol = 2

and many of these 280 people probably share a lot of opinions. If all of the users are

rating similar amounts of similar movies with similar scores, the dataset will be fairly

homogenous, which will lead to strange and poor clusterings. After considering these

aspects of the dataset, it makes more sense that both methods found a few very large

and dense clusters.

Unfortunately, because the matrix is so large, there was not enough memory in

Matlab to compute the entropy scores for the SVD Signs and SVD Gaps clusterings

of the Netflix dataset, and so we cannot conclude whether one algorithm preformed

better than another on this dataset.

58


61/86

Chapter 7

Conclusion

Thus far, SVD Signstends to be the more robust clustering algorithm. Because it only

have one parameter, k, it is easier to work with and produces better overall clusterings

on a more regular basis than SVD Gaps. With SVD Gaps method introduced here,

we not only have to choose the appropriate value for k, we also have to choose an

appropriate tolerance level for the dataset.

However, there are some very positive aspects of the SVD Gaps method. The

method showed again and again that it excelled in singling out the strongest clusters in

the datatset, while ignoring weaker and less important clusters. In many applications

of clustering, this might be a highly desirable quality. Regardless, SVD Gaps has

not yet reached its full potential. It might be helpful to first learn more about

the statistical distribution of a dataset before applying the SVD Gaps algorithm.

This would help us to determine a tolerance level, and maybe even whether different

tolerance levels should be used for different singular vectors.

The Cluster Aggregation algorithm presented in Chapter 5 has a lot of promise

as well, and it would have been nice to spend more time working and experimenting

with it. Some expansions on this algorithm also need to be explored. Weighting

clusterings before aggregating them (perhaps somehow inversely proportionate to

59


62/86

their entropy scores) could improve results. Also, the algorithm could be used to

aggregate clusterings for different values of k, especially in cases where one value

produces too few clusters, but a higher value produces too many.

Some other areas that need further work are the entropy measure, the method

of choosing k, and cluster ordering. The entropy measure works nicely, but the

labeling scheme needs to be expanded to work on more dense matrices, as it now

works only for sparse matrices. Since, the success of either SVD Signs or SVD Gaps

depends so much on the choice of k, a more reliable method must be found and

employed for both algorithms. Finally, it would be nice if clusters were ordered and

displayed so that clusters next to each other are most similar; this would be most

useful when analyzing a clustering for data mining purposes.

60


63/86

Appendix A

MATLAB Code

A.1 SVD Signs

function []=svdclusterrect(A,k,centeringtoggle,trms,docs);

%% INPUT: A = m by n matrix

%% k = number of principal directions to compute

%% centeringtoggle = 1 if you want to center the data matrix A

%% first and work on thet centered matrix C

%% = 0 if you want to work with uncentered data

%% matrix A

%% trms = row labels

%% docs = column labels

if centeringtoggle==1

mu = A*ones(n,1)/n;

61


64/86

A= A-mu*ones(1,n); % A is now the centered A matrix

end

[U,S,V]=fastsvds(A,k); %finds truncated SVD of A

% to find a row reordering that clusters rows

E=(U>=0);

%finds positive entries of left singular vectors

x=zeros(m,1);

%creates vector of zeros

for i=1:k;

x=x+(2^(i-1))*(E(:,k-i+1));

end

%designates sign pattern to each row of U

[sortedrowx,rowindex]=sort(x);

%sorts x by sign patterns of U

numrowclusters=length(unique(x))

%finds the number of row clusters of A

% to find a column reordering that clusters columns

F=(V>=0);

62


65/86

% finds all positive entries of matrix V

y=zeros(n,1); %creates vector of zeros

for i=1:k;

y=y+(2^(i-1))*(F(:,k-i+1));

end

%designates signs patterns to each row of V

[sortedcoly,colindex]=sort(y);

%sorts y by sign patterns of V

numcolclusters=length(unique(y))

%finds number of column clusters of A


A= A+mu*ones(1,n);

% A changed back to the original uncentered A matrix

end

reorderedA=A(rowindex,colindex);

%reorders A into clustered form

spy(reorderedA)

%spyplot of reordered matrix

trms=cellstr(trms);

63


66/86

reorderedterms=trms(rowindex);

%reorders row labels

docs=cellstr(docs);

reordereddocs=docs(colindex);

%reorders column labels

er=entropy2(reorderedA*reorderedA,x,m,numrowclusters)

%finds row entropy of reordered A

ec=entropy2(reorderedA*reorderedA,y,n,numcolclusters)

%finds column entropy of reordered A

cd /Users/langvillea/David/vismatrix2

vismatrix2(reorderedA,reorderedterms,reordereddocs)

cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix

%sends reordered A to VISMATRIX tool

A.2 SVD Gaps

function []=svdgapcluster2(A,k,centeringtoggle,confidence,

confidence2,termlabels,doclabels)

%% INPUT: A = m by n matrix

%% k = number of principal directions to compute

%% centeringtoggle = 1 if you want to center the data matrix A

%% first and work on the centered matrix C

64


67/86

%% = 0 if you want to work with uncentered data

%% matrix A

%% confidence = tolerance level for rows

%% confidence = tolerance level for columns

%% termlabels = labels for rows

%% doclabels = labels for columns

[m,n]=size(A);

% finds the dimensions of A


mu = A*ones(n,1)/n;

A= A-mu*ones(1,n); % A is now the centered A matrix

end

[U,S,V]=fastsvds(A,k);

% finds truncated SVD of A

% later do smart implementation of svd on centered data

%using rank-one udpate rules.

%% for Term Clustering

[sortedU,index]=sort(U);

%sort left singular vectors

gapmatrix=sortedU(2:m,:)-sortedU(1:m-1,:);

65


68/86

%each column of gapmatrix contains gaps of a left singular vector

gapmeans=mean(gapmatrix,1);

%find mean gap size for each vector

stdgaps=std(gapmatrix,1,1);

%find st. dev of gap size for each vector

gapszscore=(gapmatrix-ones(m-1,1)*gapmeans)./(ones(m-1,1)*stdgaps);

%convert all gaps to z-scores

[row,col]=find(gapszscore>confidence);

%find indices for all z-scores that are greater than

%tolerance level for rows

D=full(sparse(row,col,ones(length(row),1),m,k));

%creates binary sparse matrix whose columns

%contain ones to mark where large gaps in sing vectors occur

C=zeros(m,k);

for j=1:k

count=0;

for i=1:m

C(i,j)=count+1; %creates cluster label matrix

if D(i,j)==1

count=count+1;

%cluster label changes where large gaps occur

end

end

end

66


69/86

% matrix C is the matrix of cluster labels

% need to sort C by index matrix

[sortedindex,IIndex]=sort(index,1);

for i=1:k

C(:,i)=C((IIndex(:,i)),i);

end

%each column of cluster labels is now sorted

[b,i,h]=unique(C,rows);

%finds the rows of C with the same label patterns

%these rows will be clustered together

[termclusters,termclusterindex]=sort(h);

%finds row reordering for A

C(termclusterindex,:);

%reorders the cluster label matrix

%rows with same cluster label patterns will be adjacent

numtrmclusters=size(b)

%finds number of row clusters of A

%% For Doc Clustering

[sortedV,indexV]=sort(V);

%sort right singular vectors

gapmatrix=sortedV(2:n,:)-sortedV(1:n-1,:);

%each column of gapmatrix contains gaps of a right singular vector

gapmeans=mean(gapmatrix,1);

67


70/86

%find mean gap size for each vector

stdgaps=std(gapmatrix,1,1);

%find st. dev of gap size for each vector

gapszscore=(gapmatrix-ones(n-1,1)*gapmeans)./(ones(n-1,1)*stdgaps);

%convert all gaps to z-scores

[row,col]=find(gapszscore>confidence2);

%find indices for all z-scores that are

%greater than tolerance level for columns

F=full(sparse(row,col,ones(length(row),1),n,k));

%creates binary sparse matrix whose columns

%contain ones to mark where large gaps in sing vectors occur

E=zeros(n,k);

for j=1:k

count=0;

for i=1:n

E(i,j)=count+1; %creates cluster label matrix

if F(i,j)==1

count=count+1;

%cluster label changes where large gaps occur

end

end

end

% matrix E is the matrix of cluster labels

% need to sort E by index matrix

68


71/86

[sortedindexV,IIndexV]=sort(indexV,1);

for i=1:k

E(:,i)=E((IIndexV(:,i)),i);

end

%each column of cluster labels is now sorted

[b,i,z]=unique(E,rows);

%finds the rows of C with the same label patterns

%these rows represent columns of A that will be clustered together

[docclusters,docclusterindex]=sort(z);

%finds row reordering for A

E(docclusterindex,:);

%reorders the cluster label matrix

%rows with same cluster label patterns will be adjacent

numdocclusters=size(b)

%finds number of row clusters of A


A= A+mu*ones(1,n);

end

% A is now back to the original uncentered A matrix

reorderedA=A(termclusterindex,docclusterindex);

%reorders A into clustered form

spy(reorderedA)

%spyplot of reordered A

69


72/86

doclabels=cellstr(doclabels);

doclabels=doclabels(docclusterindex);

%reorders column labels of A

termlabels=cellstr(termlabels);

termlabels=termlabels(termclusterindex);

%reorders row labels of A

er=entropy2(reorderedA*reorderedA,h,m,numtrmclusters(1))

%finds row entropy of reordered A

ec=entropy2(reorderedA*reorderedA,z,n,numdocclusters(1))

%finds column entropy of reordered A

cd /Users/langvillea/David/vismatrix2

vismatrix2(reorderedA*reorderedA, termlabels, doclabels)

cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix

%sends reordered A to VISMATRIX tool

A.3 Entropy Measure

function [entropy]=entropy2(A,clusters,n,k)

%input A is symmetric row by row or column by column

%n is the number of rows of A

%k is the number of clusters of A

x=sort(clusters); %clusterlabels in descending order

70


73/86

[m,xstart]=unique(x,first); %find cluster start row

[m,xstop]=unique(x,last); %find cluster stop row

spy(A)

Q=zeros(n,k);

for i=1:n

for j=1:k

maxvector(j)=nnz(A(i,xstart(j):xstop(j)))/((xstop(j)+1)-xstart(j));

end

[maxratio,index]=max(maxvector);

Q(i,index(1))=1;

end

%forces row labels using max ratio vector

P=zeros(k,k);

for j=1:k

for i=1:k

P(j,i)=sum(Q(xstart(j):xstop(j),i))/((xstop(j)+1)-xstart(j));

end

end

%creates a matrix of p(i,j)s

P=(P.*log(P))/log(k);

[row,col]=find(isnan(P));

for i=1:length(row)

71


74/86

P(row(i),col(i))=0;

end

%if p(i,j)=0 then set p(i,j)*log p(i,j)=0

for i=1:k

H(i)=-sum(P(i,:));

alpha(i)=((xstop(i)+1)-xstart(i))/n;

end

%finds entropy measure for each cluster

entropy=sum(alpha.*H)

%finds entropy of entire partition

A.4 Cluster Aggregation

%%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%%%%%

%% INPUT : L = n-by-p matrix of cluster results;

%% column i contains results from clustering method i

%% OUTPUT : A = n-by-n weighted undirected (symmetric)

%% Aggregation matrix;

%% A(i,j) = # of methods having items i and j in same cluster

%% EXAMPLE INPUT: L=[1 3 1;3 1 2; 2 2 2 ; 1 3 1; 1 1 3; 2 2 2]

% % L =

%%

%% 1 3 1

72


75/86

%% 3 1 2

%% 2 2 2

%% 1 3 1

%% 1 1 3

%% 2 2 2

%% means that clustering method 1 (info. in col. 1 of L) groups items

%% 1, 4 and 5 together, then

%% items 3 and 5, and finally item 2 in its own cluster, creating

%% a total of three clusters. Clustering method 2

%% (info. in col. 2 of L)

%% groups items 1 and 4, then 3 and 6, and finally 2 and 5. Notice

%% that the cluster assignment labels used by one clustering

%% method do not need to match those from another method. And

%% the number of clusters found by each method do not need to

%% match either. Yet all lists must be full, i.e., have the same

%% number of items.

function [] = ClusterAgg(L,labels);

% n = # of items/documents

% p = # of lists of cluster assignment results = # of methods

[n,p]=size(L);

A=zeros(n,n);

% need to do p*(n choose 2) pairwise comparisons to create

%Aggregation matrix A.

for i=1:n

73


76/86

for j=i+1:n

matchcount=0;

for k=1:p

if L(i,k)==L(j,k)

matchcount=matchcount+1;

end

end

A(i,j)=matchcount;

end

end

A=A+A;

%Now run any clustering method you like on Aggregation matrix A

%For example, code for running the extended Fiedler method is below.

%The Fiedler method is a good choice since the graph is undirected.

% F = Fiedler matrix F=D-A

D=diag(A*ones(n,1));

F=D-A;

% k = # of eigenvectors to use for extended Fiedler

k=2;

[FiedlerVector,evalue]=eigs(F,k+1,sa);

FiedlerVector=FiedlerVector(:,2:k+1)

U=(FiedlerVector>=0);

74


77/86

x=zeros(n,1);

for l=1:k;

x=x+(2^(k-l))*(U(:,l));

end

% x contains the cluster assignments

x

% numclusters is the number of clusters produced by the aggregated

%method

numclusters=length(unique(x))

%%%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%

A.5 Other Code Used

%%%%%%%%%%%BEGIN GLEICHS fastsvds.m%%%%%%%%%%%

function [U, S, V] = fastsvds(varargin)

% fastsvds performs the singular value decomposition more quickly than

% matlabs built in function.

%

% [U S V] = fastsvds(A, k, sigma, opts)

%

% see svds for a description of the parameters.

%

75


78/86

A = varargin{1};

[m,n] = size(A);

p = min(m,n);

if nargin < 2

k = min(p,6);

else

k = varargin{2};

end

if nargin < 3

bk = min(p,k);

if isreal(A)

bsigma = LA;

else

bsigma = LR;

end

else

sigma = varargin{3};

if sigma == 0 % compute a few extra eigenvalues to be safe

bk = 2 * min(p,k);

else

bk = k;

end

if strcmp(sigma,L)

if isreal(A)

bsigma = LA;

76


79/86

else

bsigma = LR;

end

elseif isa(sigma,double)

bsigma = sigma;

if ~isreal(bsigma)

error(Sigma must be real);

end

else

error(Third argument must be a scalar or the string L)

end

end

if isreal(A)

boptions.isreal = 1;

else

boptions.isreal = 0;

end

boptions.issym = 1;

if nargin < 4

% norm(B*W-W*D,1) / norm(B,1) norm(A*V-U*S,1) / norm(A,1)


80/86

options = varargin{4};

if isstruct(options)

if isfield(options,tol)

boptions.tol = options.tol / sqrt(2);

else

boptions.tol = 1e-10 / sqrt(2);

end

if isfield(options,maxit)

boptions.maxit = options.maxit;

end

if isfield(options,disp)

boptions.disp = options.disp;

else

boptions.disp = 0;

end

else

error(Fourth argument must be a structure of options.)

end

end

if (m > n)

% this means we want to find the right singular vectors first

% [V D] = eigs(A*A)

%f = inline(global AFASTSVDMATRIX;

%AFASTSVDMATRIX*(AFASTSVDMATRIX*v), v);

[V D] = eigs(@multiply_mtm, n, bk, bsigma, boptions, A);

[dummy, perm] = sort(-diag(D));

78


81/86

S = diag(sqrt(diag(D(perm, perm))));

V = V(:, perm);

Sinv = diag(1./sqrt(diag(D)));

U = (A*V)*Sinv;

else

% find the left singular vectors first

% [U D] = eigs(A*A)

%f = inline(global AFASTSVDMATRIX; A*(A*v), v);

[U D] = eigs(@multiply_mmt, m, bk, bsigma, boptions, A);

[dummy, perm] = sort(-diag(D));

S = diag(sqrt(diag(D(perm, perm))));

U = U(:, perm);

Sinv = diag(1./sqrt(diag(D)));

V = Sinv*(U*A);

V = V;

end;

if nargout


82/86

function mmtv = multiply_mmt(v, A)

mmtv = A*(A*v);

%global AFASTSVDMATRIX;

%mmtv = AFASTSVDMATRIX*(AFASTSVDMATRIX*v);

%%%%%%%%%%%%%END GLEICHS fastsvds.m%%%%%%%%%%

%%%%%%%%%%readSMAT.m%%%%%%%%%%%%

function A = readSMAT(filename)

% readSMAT reads an indexed sparse matrix representation of

% a matrix and creates a MATLAB sparse matrix.

%

% A = readSMAT(filename)

% filename - the name of the SMAT file

% A - the MATLAB sparse matrix

%

s = load(filename);

m = s(1,1);

n = s(1,2);

ind_i = s(2:length(s),1)+1;

ind_j = s(2:length(s),2)+1;

val = s(2:length(s),3);

A = sparse(ind_i,ind_j,val, m, n);

%%%%%%%%%%end readSMAT.m%%%%%%%%%%

80


83/86

Bibliography

[1] Charles J. Alpert, Andrew B. Kahng, and So zen Yao. Spectral partitioning: The

more eigenvectors, the better. In Proc. ACM/IEEE Design Automation Conf,

pages 195200, 1995.

[2] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using

PageRank vectors. In Proceedings of the 47th Annual IEEE Symposium on Foun-

dations of Computer Science, 2006.

[3] Francis R. Bach and Michael I. Jordan. Finding clusters in independent compo-

nent analysis. In In: 4th Intl. Symp. on Independent Component Analysis and

Signal Separation (ICA2003, pages 891896, 2003.

[4] Barbara E. Ball. Clustering directed graphs without symmetrization, Masters

Thesis, College of Charleston, 2006.

[5] Ibai E. Basabe. A new way to cluster data, Masters Thesis, College of Charleston,

2007.

[6] M. W. Berry, S. T. Dumais, G. W. Obrien, Michael W. Berry, Susan T. Dumais,

and Gavin. Using linear algebra for intelligent information retrieval. SIAM

Review, 37:573595, 1995.

81


84/86

[7] Michael W. Berry and Murray Browne. Understanding search engines: mathe-

matical modeling and text retrieval. Society for Industrial and Applied Mathe-

matics, Philadelphia, PA, USA, 1999.

[8] Mike W. Berry and Murray Browne. Lecture Notes in Data Mining. World

Scientific Publishing Co., 2006.

[9] Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowl-

edge Discovery, 2:325344, 1998.

[10] Fan R.K. Chung. Spectral Graph Theory. Number 92. American Mathematical

Society, 1997.

[11] Matt Rasmussen David Gleich, Leonid Zhukov. Vismatrix, 2006.

[12] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Arnold

Publishers, May 2001.

[13] Miroslav Fiedler. Algebraic connectivity of graphs. Czech

Documents

Clustering SVD Master Thesis