Clustering SVD Master Thesis

Embed Size (px)

Citation preview

  • 7/27/2019 Clustering SVD Master Thesis

    1/86

    CLUSTERING DATASETS WITH SINGULAR VALUE

    DECOMPOSITION

    A thesis submitted in partial fulfillment of the requirements for the

    degree

    MASTER OF SCIENCE

    in

    MATHEMATICS

    by

    EMMELINE P. DOUGLAS

    NOVEMBER 2008

    at

    THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON

    Approved by:

    Dr. Amy Langville, Thesis Advisor

    Dr. Ben Cox

    Dr. Katherine Johnston-Thom

    Dr. Martin Jones

    Dr. Amy T. McCandless, Dean of the Graduate School:

  • 7/27/2019 Clustering SVD Master Thesis

    2/86

    2009

    Copyright 200 by

    'RXJODV(PPHOLQH3

    .

    All rights reserved

  • 7/27/2019 Clustering SVD Master Thesis

    3/86

    ABSTRACT

    CLUSTERING DATASETS WITH SINGULAR VALUE

    DECOMPOSITION

    A thesis submitted in partial fulfillment of the requirements for the

    degree

    MASTER OF SCIENCE

    in

    MATHEMATICS

    by

    EMMELINE P. DOUGLAS

    NOVEMBER 2008

    at

    THE GRADUATE SCHOOL OF THE COLLEGE OF CHARLESTON

    Spectral graph partitioning has been widely acknowledged as a useful way to cluster

    matrices. Since eigen decompositions do not exist for rectangular matrices, it is

    necessary to find an alternative method for clustering rectangular datasets. The

    Singular Value Decomposition lends itself to two convenient and effective clustering

    techniques, one using the signs of singular vectors and the other using gaps in singular

    vectors. We can measure and compare the quality of our resultant clusters using an

    entropy measure. When unable to decide which is better, the results can be nicelyaggregated.

    1

  • 7/27/2019 Clustering SVD Master Thesis

    4/86

    Contents

    1 Introduction 5

    2 The Fiedler Method 9

    2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1.1 Clustering with the Fiedler vector . . . . . . . . . . . . . . . . 11

    2.1.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

    2.2 Extended Fiedler Method . . . . . . . . . . . . . . . . . . . . . . . . 14

    2.2.1 Clustering with Multiple Eigenvectors . . . . . . . . . . . . . . 14

    2.2.2 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    3 Moving from Eigenvectors to Singular Vectors 17

    3.1 Small Example Datasets . . . . . . . . . . . . . . . . . . . . . . . . . 20

    3.2 How to Cluster a matrix with SVD Signs . . . . . . . . . . . . . . . . 21

    3.2.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . 24

    3.2.2 Why the SVD Signs method works . . . . . . . . . . . . . . . 25

    3.2.3 Limitations of SVD signs . . . . . . . . . . . . . . . . . . . . . 30

    3.3 SVD Gaps Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

    3.3.1 When are Gaps Large Enough? . . . . . . . . . . . . . . . . . 34

    3.3.2 Results on Small Yahoo! dataset . . . . . . . . . . . . . . . . 35

    4 Quality of Clusters 37

    2

  • 7/27/2019 Clustering SVD Master Thesis

    5/86

    4.1 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    4.2 Comparing results from Small Yahoo! Example . . . . . . . . . . . . 42

    5 Cluster Aggregation 44

    5.1 Results on Small Yahoo! Dataset . . . . . . . . . . . . . . . . . . . . 47

    6 Experiments on Large datasets 49

    6.1 Yahoo! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    6.2 Wikipedia . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

    6.3 Netflix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

    7 Conclusion 59

    A MATLAB Code 61

    A.1 SVD Signs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

    A.2 SVD Gaps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

    A.3 Entropy Measure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    A.4 Cluster Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

    A.5 Other Code Used . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

    3

  • 7/27/2019 Clustering SVD Master Thesis

    6/86

    Acknowledgements

    I owe a great debt to the many people who helped me to successfully complete this

    thesis. First and most of all, to my thesis advisor Dr. Amy Langville who not only

    listened to countless research updates, presentations, and MATLAB complaints, but

    who encouraged me and gave me the confidence to start a thesis in the first place.

    Second to Kathryn Pedings who was always a ready and willing soundboard whenever

    I was excited or frustrated and needed someone to talk to. Also I would like to thank

    my committee members for their interest, criticism, and encouragement. Last, but

    not least, I would like to thank my fiance, Andrew Aghapour, for being patient with

    me and my mood swings as I finished this thesis.

    4

  • 7/27/2019 Clustering SVD Master Thesis

    7/86

    Chapter 1

    Introduction

    Since the introduction of the internet to the public in the late 1980s, life has become

    much more convenient. Instead of making trips to the bank, the mall, the post

    office, and the grocery store, a person can manage their finances, pay their bills,

    shop for gifts, and even buy groceries from the comfort of their own home. As lucky

    as the consumers think they are, the internet has actually proven to be a greater

    boon to the companies providing these conveniences. Now companies can easily

    gather information about their customers that they may not have known before,

    which, in the end, will help them make even more sales to even more customers.

    For example, the internet company Netflix invests a great deal of time and money

    collecting data about their customers movie preferences. They use this information

    to make recommendations to other customers. Better recommendations may result

    in more rentals and, more importantly, higher customer loyalty. However, moving

    from data collection to movie recommendation is not a trivial task, as the datasets

    inevitably grow quite large. One useful way to glean information from these massive

    datasets is to cluster them.

    Clustering is a data mining technique which reorganizes a dataset and places

    objects from the dataset into groups of similar items.1 When the dataset is rep-

    1Note to the Reader: Clustering should not be confused with classification; classification names

    5

  • 7/27/2019 Clustering SVD Master Thesis

    8/86

    resented as a matrix, clustering is essentially reordering the matrix so that similar

    rows and columns are near each other. Datasets can be created in hundreds of dif-

    ferent structures and sizes, therefore it makes sense that the methods used to cluster

    them are just as abundant and varied. Hundreds of different clustering techniques

    have been developed by both mathematicians and computer scientists over the years;

    these techniques can be grouped into two main catagories: hierarchical and partitional

    [28].

    Hierarchical: In Hierarchical algorithms, clusters are created in a tree-like

    process by which the dataset is broken down into nested sets of clusters based on some

    measure of similarity between objects. An example diagram describing this process

    is shown in Figure 1.1.

    Figure 1.1: Tree diagram for a hierarchical clustering algorithm. Any vertical cutwould result in a clustering.

    Hierarchical algorithms can be subdivided into groups: the more widely usedagglomerative (or bottom up) methods, and the divisive (or top down) methods.

    Linkage methods such as the Nearest Neighbors algorithm, which forms clusters by

    grouping objects that are nearest to each other, and the Centroid Method, which

    or qualifies the groups and clustering does not, though it might be used as a means to that end (i.e.datasets may be easier to classify once they have been clustered).

    6

  • 7/27/2019 Clustering SVD Master Thesis

    9/86

    chooses central objects and then clusters the other objects according to their proximity

    to either centroid, are good examples of popular agglomerative clustering algorithms

    [8]. Divisive algorithms work in the opposite direction, starting with the full dataset

    as one cluster and then splitting it into smaller and smaller pieces. However, these

    techniques tend to be more computationally demanding, and, as mentioned before,

    are not as popular as the agglomerative methods. More examples of hierarchical

    techniques along with some discussion of their merits and disadvantages can be found

    are summarized by Everitt et al. in [12].

    Partitional: Partitional algorithms work by dividing the dataset into disjoint

    subsets. Principal Direction Divisive Partitioning, or PDDP, which divides a dataset

    into halves using the principal direction of variation of the dataset as described by

    Boley in [9], falls into this catagory along with the other Singular Value Decomposition

    (SVD) based algorithms presented in Chapter 3. Spectral methods, or clustering

    algorithms that analyze components of the eigen decomposition, are also partitional

    algorithms. One partional technique that has gotten a lot of attention lately uses

    the PageRank vector to cluster data [2]. Independent Component Analysis, or ICA,

    analyzes and divides a dataset so that objects between clusters are independent, and

    objects within clusters are dependent [3], [17]. The k-means algorithm is a very

    popular partitional algorithm that has long been upheld as a standard in the field

    of clustering due its efficiency, flexibility, and robustness. The algorithm divides the

    data into k groups centered around the k cluster centers that must be chosen at the

    outset of the algorithm [19]. A nice comparison of many different algorithms, both

    hierarchical and partional, is presented by Halkidi et al. in [16].I go more into detail about the history of spectral clustering in Chapter 2 since

    this particular field gave birth to the SVD clustering methods introduced in Chapter

    3; the first, SVD Signs, is an algorithm outlined by Dr. Carl Meyer in [22], and

    the second, SVD Gaps is my own SVD-based clustering algorithm. Of course, after

    7

  • 7/27/2019 Clustering SVD Master Thesis

    10/86

    introducing these two SVD clustering methods, some measure of cluster goodness

    is necessary in order to compare the two methods directly. Therefore, in Chapter 4

    an entropy measure will be introduced that can be used to measure the how well a

    dataset has been clustered. Many times, though, it is useful to be able to find an

    average clustering when one algorithm does not stand out above the rest, so Chapter

    5 introduces some ideas about cluster aggregation, a way to combine the results from

    several clustering algorithms, that will be helpful in cases where multiple algorithms

    produce good clusterings. Throughout these chapters small datasets will be used as

    examples to help clarify how the algorithms work, and what a well-clustered matrix

    looks like. However, as mentioned above, datasets tend to be very large in real life.

    Thus, it is crucial that clustering methods work as well on these large datasets as

    they do on the smaller ones. In Chapter 6 the results of my experiments with the

    SVD Signs and SVD Gaps algorithms on three large datasets (one from Yahoo!, one

    from Wikipedia, and one from Netflix) will be presented.

    The images in this paper have been created using several different computer

    programs. I used MATLAB for some of the simpler images such as line graphs, and

    the Apple Grapher application to display small three dimensional datasets. All of the

    images of large matrices were created with David Gleichs VISMATRIX tool [11].

    8

  • 7/27/2019 Clustering SVD Master Thesis

    11/86

    Chapter 2

    The Fiedler Method

    Though graph theoretic clustering has been used heavily by computer scientists for in

    recent decades, the understanding of these methods is rather new. The mathematics

    behind these methods were not explored until the late 1960s and early 1970s. In 1968,

    Anderson and Morely published their paper [18] on the eigenvalues of the Laplacian

    matrix, which is a special matrix in graph theory will be defined in Section 2.1. Then

    in 1973 and 1975, Miroslav Fiedler published his landmark papers [14] and [13] on

    the properties of the eigensystems of the Laplacian matrix. Fiedlers ideas were not

    applied to the field of clustering until Pothen, Simon, and Liou did so with their

    1990 paper [24]. These papers are the origins of spectral graph partitioning, sub-

    field of clustering that uses the spectral or eigen properties of a matrix to identify

    clusters. There are two methods in the spectral category that inspired the SVD-based

    clustering methods introduced in Chapter 3: the Fiedler Method and the Extended

    Fiedler method.

    2.1 Background

    The Fiedler Method takes its name from Miroslav Fiedler because of two important

    papers he published in 1973 and 1975 that explored the properties of eigensystems

    9

  • 7/27/2019 Clustering SVD Master Thesis

    12/86

    of the Laplacian Matrix. What is the Laplacian matrix? Consider this small graph

    with 10 vertices or nodes that are connected by several edges.

    Figure 2.1: Small graph with 10 nodes and its adjacency matrix.

    We can easily represent this graph with a binary adjacency matrix, where the

    rows and columns represent the 10 nodes, and non-zero entries represent the edges

    between nodes. Any graph, small or large, can be fully represented by a matrix. Once

    we have an adjacency matrix, the corresponding Laplacian matrix L can be found by

    L=

    D

    A,(2.1)

    where A is the adjacency matrix, and D is a diagonal matrix containing the row sums

    of A. Figure 2.2 shows the Laplacian matrix for the 10 node graph given above.

    Figure 2.2: Finding the Laplacian matrix for the adjacency matrix in Figure 2.1

    These matrices can be used to discover important properties of the graphs they

    10

  • 7/27/2019 Clustering SVD Master Thesis

    13/86

    represent. Most importantly, a Laplacian matrix can give us information about the

    connectivity of the graph it represents. In fact, in his papers [14] and [13], Miroslav

    Fiedler proved that the eigenvector corresponding to the second smallest eigenvalue,

    which is now called the Fiedler vector, can tell us how a graph can be broken down into

    maximally intraconnected components and minimally interconnected components. In

    other words, the Fiedler vector is a very useful tool for partitioning the graph. A more

    in-depth look at spectral graph theory can be found in Chungs Spectral Graph Theory

    [10]. Some more important results about spectral partitioning are shown in [26] and

    [20].

    2.1.1 Clustering with the Fiedler vector

    Suppose we have the graph from Figure 2.1 with its corresponding Laplacian matrix

    (see Figure 2.3).

    Figure 2.3: Graph with 10 nodes and its Laplacian matrix.

    From the Laplacian matrix we obtain the eigen decomposition (Figure 2.4

    shows the eigenvectors and eigenvalues from the Laplacian matrix in Figure 2.3. No-

    tice that the smallest eigenvalue is 0, and its corresponding eigenvector is a scalar

    multiple of the identity vector, as is the case with all Laplacian matrices. The eigen-

    vector we are interested in is the Fiedler vector, which is circled. We can use the

    11

  • 7/27/2019 Clustering SVD Master Thesis

    14/86

    signs of this eigenvector to cluster our graph. This clustering method is known as the

    Fiedler Method.

    Figure 2.4: Eigenvectors and eigenvalues of L, with second smallest eigenvalue andthe Fiedler vector circled.

    The rows with the same sign are placed in the same cluster. Therefore, for

    the 10 node example, nodes 1, 2, 3, 7, 8, and 9 are in one cluster while nodes 4, 5, 6,

    and 10 are in another cluster. Looking at Figure 2.5, we can see this partition makes

    a lot of sense in the context of the graph. As expected, the Fiedler Method cut the

    graph into two better connected subgraphs.

    Figure 2.5: Signs ofv2, the Fiedler vector, and the partition made by the first iterationof the Fiedler Method.

    The next step is to take each subgraph and partition each with its own Fiedler

    vector. We will only do this with the larger sub graph, since the cluster containing

    nodes 4, 5, 6, and 10 is fully connected and so it does not make sense to cluster it

    12

  • 7/27/2019 Clustering SVD Master Thesis

    15/86

    further. However, as we can see in Figure 2.6, the second iteration of the Fiedler

    Method works very nicely on the second half of the graph.

    Figure 2.6: Partition made by the second iteration of the Fiedler Method.

    For this small graph, two iterations are sufficient for a satisfactory clustering.

    Though, as graphs become larger, certainly many more iterations are necessary. The

    algorithm stops when no more partitions can be made such that the number of edges

    between two clusters is less than the minimum number of edges within either of

    the two clusters. Barbara Ball presents a much more thorough explanation of this

    algorithm in [4].

    The Fiedler Method has been shown to perform very well in experimentation.

    Some experimental results are given in [31], [4], and [15].

    2.1.2 Limitations

    Though this method is theoretically sound, and has been shown to work very nicely on

    large as well as small square symmetric matrices, it does have drawbacks. First, the

    Fiedler Method is iterative. Therefore, if at any point a questionable partition is made,

    the mistake is exacerbated by further iterations. Also, new eigen decompositions must

    be found at every iteration, which can be expensive for larger datasets.

    Secondly, the Fiedler Method only works for square symmetric matrices. Many

    13

  • 7/27/2019 Clustering SVD Master Thesis

    16/86

    different symmetrization techniques have been developed for non-square or non-

    symmetric matrices, but inevitably some information contained in the matrix is lost

    whenever symmetry is forced.

    Though it is still based on the eigen decomposition, the next clustering algo-

    rithm does not carry the drawbacks of an iterative procedure.

    2.2 Extended Fiedler Method

    In the last section we considered the application of the Fiedler vector to the problem of

    clustering. Surely the Fiedler vector is not the only eigenvector that can be of service.

    In fact, Extended Fiedler finds much success by incorporating multiple eigenvectors.

    2.2.1 Clustering with Multiple Eigenvectors

    Since the rise of the Fiedler Method, many mathematicians have developed a similar

    clustering algorithm, referred to as the Extended Fiedler method, that uses multiple

    eigenvectors (see references [1] and [5] for such algorithms).

    The Extended Fiedler method follows the same preliminary steps as the Fiedler

    Method, but diverges when it comes to the actual clustering. Instead of looking at

    the signs of one eigenvector, Extended Fiedler looks at the sign patterns of multiple

    eigenvectors.

    Algorithm 1 Extended Fiedler Let L be the Laplacian matrix for a symmetric

    matrix A.

    1. Find Vk, a matrix containing the first k eigenvectors of L and Ek, a diagonal

    matrix containing the first k eigenvalues of L in ascending order such that Vi is

    an eigenvector with the ith eigenvalue in Ek.

    2. Look at signs of columns 2 throughk of V

    14

  • 7/27/2019 Clustering SVD Master Thesis

    17/86

    3. If rowsi and j have the same sign pattern, then rows i and j of A belong in the

    same cluster.

    The algorithm extends to as many eigenvectors as the user deems necessary.

    If k vectors are used, then up to 2k (but often fewer) clusters result.

    To help explain Extended Fiedler, I will demonstrate how it clusters the 10

    node graph from Section 2.1 when k = 2 eigenvectors are used. Note first that the

    algorithm finds only 3 clusters (see Figure 2.7), which is less that the potential of

    4 cluster. Second, notice that for this example the Extended Fiedler method and

    the Fiedler Method produced the exact same clustering of this dataset. Also, I only

    needed to calculate one eigen decomposition. Some experimental results with this

    algorithm are detailed by Basabe in [5].

    Figure 2.7: Results of Extended Fiedler on the 10 node graph when using the signsof 2 eigenvectors.

    2.2.2 Limitations

    Though Extended Fiedler frees us from the iterative processes of the Fiedler Method,

    we are still bound to and limited by the eigen decomposition which only exists for

    square matrices, and only has real-valued eigenvalues and eigenvectors when the ma-

    trix is symmetric. The next Chapter moves on to the related but more flexible singular

    15

  • 7/27/2019 Clustering SVD Master Thesis

    18/86

    value decomposition, and the two SVD-based clustering methods, which do not have

    as many limitations as the two Fiedler algorithms. .

    16

  • 7/27/2019 Clustering SVD Master Thesis

    19/86

    Chapter 3

    Moving from Eigenvectors to

    Singular Vectors

    It would be nice to find a clustering method as simple and robust as the extended

    Fiedler method but that is more flexible and extends to rectangular matrices. Ob-

    viously, the main obstacle in our way is the decomposition being used. Is there a

    decomposition for both rectangular and square matrices that has the same structure

    as the eigen decomposition?

    As it turns out, the Singular Value Decomposition accomplishes this goal. A

    unique SVD is defined a matrix of any size, and the SVD of a square matrix is related

    to its eigen decomposition. SVD is not as widely known or studied as the eigen

    decomposition, so we will define it now.

    Definition 1 [21] The singular value decomposition of an m n matrix A with rank

    r is an orthogonal decomposition of a matrix into three matrices such that

    Amn = UmrSrrVTrn (3.1)

    .

    17

  • 7/27/2019 Clustering SVD Master Thesis

    20/86

    Figure 3.1: A Diagram for the Singular Value Decomposition of a matrix

    This decomposition is called orthogonal since the columns of U are orthogonal

    to each other. The same holds for the rows of VT. The matrix S is a diagonal matrix

    that contains the singular values of A in descending order. Singular values are always

    non-negative real numbers [21]. In Section 3.2, the three components of the SVD, U,

    S, and VT, will be addressed in greater detail.

    As mentioned above, the eigen decomposition and the singular value decom-

    position are closely connected. Suppose B and C are square symmetric matrices such

    that when B = AAT

    and C = AT

    A for some rectangular matrix A with singular

    values si, left singular vectors ui, and right singular vectors vi. Then

    Bui = s2

    i ui, (3.2)

    and

    Cvi = s2

    i vi. (3.3)

    Therefore, ui is an eigenvector ofB with eigenvalue s2

    i , and vi is an eigenvector

    ofC with eigenvalue s2i [28]. Hence, ifA is a square symmetric matrix, then the eigen

    decomposition ofA and the singular value decomposition of A are equivalent.

    The Singular Value Decomposition is an exact decomposition. In other words,

    we can multiply U, S, and VT and get back the original matrix A. However, we can

    18

  • 7/27/2019 Clustering SVD Master Thesis

    21/86

    also use SVD to find an approximation of A of rank k, where k < r, by multiplying

    only the first k columns of U, the first k values in S, and the first k rows of VT, as

    in the diagram below. This is called the truncated SVD of a matrix.

    Figure 3.2: A Diagram for the truncated Singular Value Decomposition of a matrix

    This truncated SVD not only gives us a rank k approximation to a matrix A,

    it gives us the best possible rank k approximation in the following sense:

    Theorem 1 (Eckert and Young; see [21])Let A be matrix of rank r, and let Ak be

    the SVD rank k approximation to A, with k r, and let B be any other matrix of

    rank k. Then

    A AkF A BF (3.4)

    When k

  • 7/27/2019 Clustering SVD Master Thesis

    22/86

    of these values drops, forming a sort of elbow on the line graph. If the elbow

    roccurs at the jth singular value, we might set k = j.

    Figure 3.3: A line plot of the singular values of a matrix where the y-axis represents

    the magnitude of a singular value. Since the graph drops sharply 4, it is reasonableto set k = 4.

    Because of these properties, the singular value decomposition has many appli-

    cations outside the field of clustering as well. Mike Berry and others established the

    usefulness of SVD with respect to information retrieval in [6] and [7], and Skillicorn

    has applied it to counterterrorism in [27].

    3.1 Small Example Datasets

    The next two sections cover the algorithms for my two SVD-based clustering methods.

    In these sections, it will be useful to have a small example matrix to demonstrate how

    well each method clusters and reorders the given matrix. For this purpose, a small

    subset of 45 rows, or search phrases, and 24 columns, or advertisers, was chosen from

    the large Yahoo! dataset (the full Yahoo! matrix is introduced and used in Chapter

    6). The rows and columns were selected so that the small matrix has the nice block-

    diagonal structure associated with well-clustered matrices, and then randomized it

    by means of a random permutation. This way we have an answer key with which to

    compare our results.

    20

  • 7/27/2019 Clustering SVD Master Thesis

    23/86

    Figure 3.4: Small Yahoo! example matrix, along with the terms represented by eachrow, before (left) and after (right) randomization.

    Notice there are row labels for the dataset, but no column labels. Yahoo was

    willing to release the terms represented by the dataset, but kept the advertisers names

    a secret for business and privacy reasons. This does not present a great obstacle, sincewe can still get a very good idea of how a clustering algorithm performs based on

    how the rows have been reordered.

    I will also use two other small matrices that each represent a different set of

    points shown in Figures 3.5 and 3.6 in three-dimensional space. These will be helpful

    visuals when discussing the geometric aspects of each clustering algorithm.

    3.2 How to Cluster a matrix with SVD Signs

    The first SVD based clustering methods to be discussed is the SVD signs method,

    which uses the sign patterns of the singular vectors rather than eigen vectors as done

    by the Extended Fiedler method.

    21

  • 7/27/2019 Clustering SVD Master Thesis

    24/86

    Figure 3.5: 11 by 3 matrix as a set of eleven points in three-dimensional space.

    Figure 3.6: Another set of points used in this chapter.

    As discussed earlier, eigenvectors hold a lot of information about a graphs

    connectivity, and this information is exploited by the Fiedler clustering methods.

    Since the eigen decomposition and singular value decomposition are so closely related,

    it is not surprising that the singular vectors also carry a wealth of information about

    the matrices they represent, and so play a central role in both SVD Signs and SVD

    Gaps methods of clustering.

    For both methods, the first step is to find the truncated SVD of the matrix.

    This will give us three matrices Umk, Skk, and VTkn, for some chosen k, where

    the columns of U are the dominant k, or the k vectors that contribute most to the

    dataset, left singular vectors of A, the entries in S are the dominant singular values

    22

  • 7/27/2019 Clustering SVD Master Thesis

    25/86

    of A, and the rows of V contain the dominant right singular vectors.

    Once the truncated SVD is obtained, the SVD signs algorithm uses the sign

    patterns of the singular vectors to group the rows and columns in precisely the same

    way as the sign patterns of eigenvectors were used in the extended Fiedler method.

    Rows that have the same sign pattern in the first k singular vectors are grouped

    together.

    For example, Figure 3.7, below, shows the first two left singular vectors from

    some matrix A. Using the sign patterns from these two vectors, rows 1 and 4 would

    be clustered together, rows 2, 5, 6, and 7 would be clustered together, and row 3

    would be placed in a cluster by itself. Note that if we use k singular vectors, we can

    have up to 2k clusters, since each row ofUk has k entries, each with 2 possible values.

    Luckily, the algorithm rarely yields such a high number of clusters.

    Figure 3.7: Clustering by using sign patterns of the first two singular vectors

    Why were the left singular vectors used here instead of the right ones? Recall

    that earlier this was not an issue because we used the eigen decomposition which

    has only had one set of eigenvectors. On the other hand the SVD gives two sets of

    singular vectors. Which set of singular vectors, the left or the right, should be used?

    Note also that the spectral methods only dealt with square symmetric matrices, and

    so one reordering could be applied to both rows and columns. This is not the case for

    SVD signs, which can be used on rectangular matrices, as well as asymmetric square

    23

  • 7/27/2019 Clustering SVD Master Thesis

    26/86

    matrices, and so calls for two independent re-orderings. This is where the two sets of

    singular vectors come in handy - The signs of the left singular vectors, or the

    columns of U, give a clustering for the rows, while the signs of the right

    singular vectors, or the columns of V, can be used to cluster the columns.

    Algorithm 2 SVD Signs

    1. Find [Uk, Sk, VTk ] = svds(A, k)

    2. If rows i and j of Uk have the same sign pattern, the rows i and j of A are in

    the same cluster.

    3. If columns i and j of VTk have the same sign pattern, the columns i and j of A

    are in the same cluster.

    3.2.1 Results on Small Yahoo! Dataset

    The SVD signs method performs very well on the small Yahoo! example, as can

    be seen in Figure 3.8. First, the reordered matrix has the very nice block diagonal

    structure that is characteristic of well-clustered matrices. Second, we can see that,

    with one exception, all the terms were returned to their original categories. Note that

    we chose k = 3 here. It turns out that using three left singular vectors gives the best

    clustering of the rows, even though the singular values (shown in Figure ??) suggest

    a k value of 4. This is a good example of how problematic choosing a k value can be.

    The next obvious question one might ask is: why does this work? Since SVD is

    an orthogonal decomposition, it has a very nice geometry. In fact, it is the geometrical

    properties of the SVD, which are explained and proved by Meyer in [22], that give a

    clear explanation for why this SVD Signs method works so well.

    24

  • 7/27/2019 Clustering SVD Master Thesis

    27/86

    Figure 3.8: The singular values of the Small Yahoo! dataset and the results of using

    signs method with k = 3 on the dataset.

    3.2.2 Why the SVD Signs method works

    Note that any mn matrix can be thought of as a set of m points in an n-dimensional

    space. For example, Figure 3.9 shows an 113 matrix and the corresponding cloud

    of 11 points.

    Figure 3.9: 11 by 3 matrix as a set of eleven points in three-dimensional space.

    In this geometrical context, the right singular vectors represent the vectors

    of principle trend of the data cloud. In other words, the first right singular vector

    (referred to from now on as v1) will point in the direction of highest variation in the

    data cloud. However, when we plot the data cloud and its first right singular vector

    25

  • 7/27/2019 Clustering SVD Master Thesis

    28/86

    together, as in Figure 3.10, it certainly does not look like the vector is pointing in the

    direction with the most variation.

    Figure 3.10: First right singular vector

    This is because the dataset has not been centered, and v1 looks for the direction

    of highest variation from the origin. After centering the dataset, as in Figure 3.11, v1

    does point in the accurate direction of principal trend.

    Figure 3.11: First right singular vector of centered dataset

    The second right singular vector, v2, points in the direction of secondary trend

    orthogonal to v1, and v3 points in the direction of tertiary trend orthogonal to both

    v1 and v2.

    Not only do these three right singular vectors represent the three directions of

    principal trend in the dataset, but they also represent a new set of axes for our set of

    26

  • 7/27/2019 Clustering SVD Master Thesis

    29/86

    Figure 3.12: The three right singular vectors of the dataset

    points!

    Figure 3.13: The right singular vectors can be thought of as a new set of axes for thedataset.

    In this particular example, the dimension of the original dataset and the di-

    mension of the new space created by the right singular vectors are the same because

    all of the right singular vectors were used. What happens when the truncated SVD,

    rather than the full SVD is used? In other words, what if instead of using all n right

    singular vectors, we decide to use only k of them? In this case, the original dataset

    of dimension n is projected into a space of lower dimension k. Why would anyone

    want to do this? Wouldnt a lot of the information in the original dataset be lost?

    Of course, as with any projection of this nature, some of the information will be lost.

    But the information lost will be the least important, and sometimes even superfluous,

    27

  • 7/27/2019 Clustering SVD Master Thesis

    30/86

    information that could be obfuscating important correlations in the original matrix.

    This is because the Singular Value Decomposition naturally sorts trends in the matrix

    from most important to least important [28]. Therefore, in most cases, losing extra

    dimensions does not pose a problem, and can even be helpful.

    Now let us consider the left singular vectors. What is their role? It was shown

    by Meyer [22] that the left singular vectors give the coordinates of the points on the

    new set of axes created by the right singular vectors! In other words, u1 contains the

    orthogonal projections of each point onto v1.

    Figure 3.14: The left singular vectors contain the coordinates of the points whenprojected onto each right singular vector

    This information about the geometry of SVD can now be used to better un-

    derstand the SVD Signs clustering method discussed above. When the signs of u1

    are used to divide the cloud of points into two pieces, all the points projected onto

    the positive half ofv1

    are placed in one cluster, and all the points projected onto thenegative half ofv1 are placed in another cluster. This is essentially the same as slicing

    through the centered set of points at the origin with a hyperplane that is orthogonal

    to v1. Figure 3.15 demonstrates this with the set of 11 points, and shows the two

    clusters that result.

    28

  • 7/27/2019 Clustering SVD Master Thesis

    31/86

    Figure 3.15: Points on the positive side of the hyper-plane are in one cluster, whilepoints on the negative side are in another (left). The dataset is now divided into twoclusters (right).

    These steps are repeated with the signs ofu2, resulting in another hyper-plane

    orthogonal to the first that divides the set of points with respect to v2. As shown

    in Figure 3.16, these two planes divide the space into quadrants, with each quadrant

    containing a different sign pattern and therefore a different cluster.

    Figure 3.16: Points are further clustered according to their quadrant.

    Next, the set of points is divided with respect to the signs of u3, resulting in a

    third hyper-plane that divides the space into octants. Note that not all of the octants

    happen to contain points, and so fewer than eight, or 23, clusters result.

    It is possible to go too far with the method and divide the set of points into

    too many groups. Notice that, while the first two singular vectors resulted in very

    29

  • 7/27/2019 Clustering SVD Master Thesis

    32/86

    Figure 3.17: Points further clustered according to their octant.

    intuitive clusterings, some of the clusters resulting from the third vector, particularly

    the clusters circled in Figure 3.18 are questionable. This represents the consequence

    of moving from k = 2 to k = 3 when clustering this set of points. Is it better to

    have a clustering that is too fine, or not fine enough? There is really no good answer

    to this question. The most appropriate k depends on the goals of and applications

    envisioned by the researcher.

    Figure 3.18: Cost of choosing a k that is too high.

    3.2.3 Limitations of SVD signs

    Like the Fiedler and Extended Fiedler methods before it, SVD signs clusters strictly

    according to positive and negative signs. It breaks the dataset into halves according

    30

  • 7/27/2019 Clustering SVD Master Thesis

    33/86

    to whether the projection of each point lies on the positive or negative half of a vector.

    But what if the data set doesnt naturally break in half? Or if the break point lies

    somewhere other than the middle of the dataset? For example, What if SVD Signs

    were applied to the following trimodal data set introduced earlier:

    Figure 3.19: Example of a trimodal dataset

    Since the SVD SIgns method can only bisect a dataset at any given iteration,

    it would divide this set of points into two halves, cutting right through the middle

    clump of points as shown in Figure 3.20.

    Figure 3.20: Signs method breaks dataset in half, rather than into thirds as desired.

    Clearly, dividing this set of points into three clusters by splitting it at two

    places along the direction of principal trend would be preferable, but SVD Signs

    31

  • 7/27/2019 Clustering SVD Master Thesis

    34/86

    simply does not have that capability. It was with this flaw in mind that we set out

    to create an SVD clustering method that can tailor itself to the shape of a dataset.

    3.3 SVD Gaps Method

    What if, instead of using the signs of the left singular vectors to blindly cut through

    a dataset at its center, we considered the gaps in the left singular vectors instead?

    Remember that these vectors contain orthogonal projections of the points in the

    directions of principal, secondary, and tertiary trend. Therefore, we can use them to

    find the gaps between points in any of these directions and divide the dataset where

    the gaps occur.

    Figure 3.21: Gaps preserved when a dataset is projected onto a singular vector.

    For example, when we look at the first left singular vector of the tri-modal

    dataset introduced earlier, the large gaps can be found quite easily. As seen in Figure

    3.22, this new gaps method would cut through the center of these gaps and therefore

    create the three clusters we desired earlier.

    The algorithm for SVD Gaps follows many of the same steps as the Signs

    method, but introduces a few important changes. As with the Signs method, the

    32

  • 7/27/2019 Clustering SVD Master Thesis

    35/86

    Figure 3.22: First singular vector of trimodal dataset with the gaps between entries,and the resulting cuts.

    truncated SVD, using the appropriate rank k, must be found, and then the gaps

    between entries of the left singular vectors are calculated. If a gap between two entries

    is large enough, a division is placed between the corresponding rows of the original

    matrix. Again, clustering the columns of a matrix is similar. The only difference

    being that the algorithm uses gaps in the right singular vectors to determine where

    to divide the columns.

    Algorithm 3 SVD Gaps

    1. Find [Uk, Sk, Vk] = svds(A, k)

    2. For 1 i k, sort Ui (or Vi if clustering columns) and find the gaps between

    entries.

    3. If the gap between rows j and j + 1 of Ui (Vi) is large enough then divide A

    between the corresponding rows (columns).

    4. Create a column vector Ci that contains numerical cluster labels for Ui (Vi) for

    all rows (columns).

    5. After findingCi for all 1 i k, compare cluster label patterns for rows of C.

    33

  • 7/27/2019 Clustering SVD Master Thesis

    36/86

    6. If rows (columns) i and j have the same cluster label pattern in C, then rows

    (columns) i and j belong in the same cluster.

    3.3.1 When are Gaps Large Enough?

    Earlier, the issue of when gaps are large enough was glossed over. However, this is

    a very important component of the Gaps algorithm. In fact, the effectiveness of the

    entire method rests on deciding when a gap in the data is large enough to be used as

    a cut. If the criteria are too relaxed, there will be far too many clusters. However, if

    the criteria are too stringent the algorithm might not discover as many clusters as it

    should.So how big is big enough? Obviously, the measure should be relative to the

    data. For example, some datasets might have smaller gaps over all, and so have

    smaller significant gaps, than other datasets. Therefore, the significance of a gap

    should be tied to the average size of the gaps in the data. However, the points in

    a dataset can be greatly spread out in the direction of v1, but more compact in the

    direction of v2. In this case, it would not be a good idea to pool all the gaps from

    all the singular vectors together, as this would result in too many cuts in the earlier

    vectors and too few cuts in the later vectors. It makes sense, then, for an average to

    be taken with respect to each individual singular vector, which is exactly what the

    SVD Gaps algorithm does.

    Now that we have an average gap size for each singular vector, we know we

    should only choose gaps that are larger than the average gap, but how much larger

    should it be? It would make a lot of sense if, before making this decision, the algorithm

    took into account how spread out the gap sizes were. Therefore, a standard deviation

    for the gaps of each singular vector should also be found. From there, it is easy to

    calculate how many standard deviations away each gap is from the average gap (this

    might bring to mind the z-score of the normal distribution, but we must remember

    34

  • 7/27/2019 Clustering SVD Master Thesis

    37/86

    that these gaps are most likely not normally distributed). Any gap that is more than

    a certain number of standard deviations larger than the average gap will be chosen as

    a place to cut the dataset. We can use Chebyshevs result about general distributions

    to help us decide the cut-off or tolerance level for the number of standard deviations;

    experimentation with various datasets has shown that using gaps that are 1.5 to

    2.5 standard deviations larger than the average yields nice clusters. The number of

    standard deviations is a parameter of the SVD Gaps method, and can be set by the

    user.

    3.3.2 Results on Small Yahoo! dataset

    SVD Gaps also performs fairly well on the small Yahoo! dataset (see Figure 3.23

    below). I chose a value of 3 for k, again, and a tolerance level (i.e. the standard

    deviation cut-off for the gaps) of 2.15.

    Figure 3.23: Small Yahoo Dataset clustered with the SVD Gaps algorithm usingk = 3 and tol = 2.15

    The reordered picture for SVD Gaps does not look quite as nice as the one

    for SVD Signs, but this does not necessarily mean that the clustering is not as good.

    However, by looking at the reordered list of terms, we can see that all of the terms

    35

  • 7/27/2019 Clustering SVD Master Thesis

    38/86

    are returned to their original group. Are the clusterings equal in strength, since they

    got essentially the same term reordering? It is hard to decide based simply on the

    appearance of the reordered matrix and reordered list of terms. The next chapter

    introduces a more rigorous way to compare the performances of the two algorithms.

    36

  • 7/27/2019 Clustering SVD Master Thesis

    39/86

    Chapter 4

    Quality of Clusters

    With small examples, such as the tri-modal set of points and the small Yahoo! dataset,

    determining whether a clustering algorithm works well is easy and can be done visually

    (especially when the dataset was created with specific clusters in mind, as was the case

    with both of these). If we want to be able to apply either of these algorithms to real

    world problems, however, a more rigorous measure of the quality of the clustering,

    or cluster goodness, must be found. Such a measure allows one to decide which

    algorithm is better: SVD Signs or SVD Gaps. With these goals in mind, lets take a

    look at an entropy measure.

    4.1 Entropy Measure

    The entropy measure used in this paper for clustering is based on the measure pre-

    sented by Meyer in [23] and revolves around the concept of surprise, or the surprise

    felt when an event occurs.

    Definition 2 [23] For an event E such that 0 < P(E) = p 1, the surprise S(p)

    elicited by the occurrence of E is defined by the following four axioms.

    1. S(1) = 0

    37

  • 7/27/2019 Clustering SVD Master Thesis

    40/86

    2. S(p) is continuous with p

    3. p < q S(p) > S(q)

    4. S(pq) = S(p) + S(q)

    Basically, events with lower probabilities elicit a higher surprise when they

    occur, and vice versa; the function for surprise in terms of the probability turns out

    to be

    S(p) = logp (4.1)

    where S(1) = 1 [23]. Now, ifX is a random variable, then the entropy of

    X is the expected surprise of X.

    Definition 3 For a discrete random variable X whose distribution vector is p, where

    pi = P(X = xi), the -entropy of X is defined

    E[S(X)] = HX = n

    i=1

    logpi (4.2)

    Set t log t = 0 when t = 0.

    Before we can apply this measure to clustering we must resolve a problem

    with the nature of rectangular datasets. One issue that makes these datasets difficult

    to work with, is that the rows must be clustered and reordered independently of

    the columns. Therefore the reordered matrix does not often have the nice, clean,

    block diagonal structure that we get from symmetric reorderings like the Fiedler or

    Extended Fiedler methods (see Figure 4.1).

    So, even if a matrix has been clustered and reordered well, it can be hard to

    tell by looking at the reordered matrix. However, if a matrix A is clustered well and

    reordered to A, then R = AAT, or the reordered row by reordered row matrix, and

    38

  • 7/27/2019 Clustering SVD Master Thesis

    41/86

    Figure 4.1: This matrix has very well defined clusters, even though the matrix is not

    in block diagonal form.

    C = ATA, or the reordered column by reordered column matrix, both have nice block

    diagonal structures with few nonzero entries outside of the blocks. For R these blocks

    represent the row clusters, and for C they represent the column clusters. Figure 4.2

    shows a perfectly block diagonal matrix, as well as one that has a few stray points. If

    these matrices had been reordered by a clustering algorithm, we would say that the

    one on the left had been clustered better than the one on the right.

    Figure 4.2: Two block diagonal matrices 1 and 2

    When we look at R and C for each of these matrices (Figures 4.3 and 4.4),

    it is even more clear that the first matrix has been clustered better than the second

    39

  • 7/27/2019 Clustering SVD Master Thesis

    42/86

    one.

    Figure 4.3: The reordered row by reordered row matrices for 1 and 2

    Figure 4.4: The reordered column by reordered column matrices for 1 and 2

    Now this idea of entropy can be applied to clustering in the following way

    (which is nicely laid out and fully explained by Meyer in [23]). Say we have a set

    of distinct objects A = {A1, A2,...,An}, each classified with a label Lj from the set

    {L1, L2,...,Ln}, and say we group these objects into k clusters {C1, C2,...,Ck}. Then

    we can create a probability distribution containing probabilities pij such that

    pij =number of objects in Ci labeled Lj

    number of objects in Ci(4.3)

    The entropy of an individual cluster Ci is then

    40

  • 7/27/2019 Clustering SVD Master Thesis

    43/86

    Hk(Ci) = k

    j=1

    pij logkpij. (4.4)

    We set pij logkpij = 0 when pij = 0. Therefore the entropy of the entire clustering or

    partition is

    H =r

    i=1

    iHk(Ci) where i =|Ci|

    n. (4.5)

    This entropy measure H has the following properties:

    0 H 1

    H = 0 if and only if Hk(Ci) = 0 for all i = 1,...,k

    H = 1 if and only if Hk(Ci) = 1 for all i = 1,...,k

    In other words the entropy scores for a clustering will range from 0 to 1 with 0 being

    the best and 1 being the worst.

    What if a dataset is not labeled? In fact, none of the datasets presented in

    the next chapter are labeled. In these instances there is a clever way to force labelson a dataset that has been clustered. Consider the small clustered 15 by 10 matrix

    shown below in Figure 4.5.

    First we divide R according to the row clusters, which in this case will result

    in 3 row blocks and 3 column blocks . Then for each row of R we create a vector of

    ratios of the number of nonzeros in each column block over the number of columns

    in that block. For example, for the first row of R in Figure 4.5, the ratio vector will

    be {44

    , 17

    , 04

    }. If the ratio for block j of row i has the highest ratio, then that row will

    be labeled j. Obviously, row one of R will be labeled 1. The second row is a more

    interesting case as the ratio vector is {44

    , 57

    , 44

    } and so we have a tie between blocks 1

    and 3. Since a row can only have one label for our entropy measure, we pick the first

    block with the highest ratio, and so row 2 of R will be labeled 1.

    41

  • 7/27/2019 Clustering SVD Master Thesis

    44/86

    Figure 4.5: A clustered 15 by 10 matrix A and the corresponding row by row matrixR.

    Most of the row labels for R match with the clustering, except for row 6. The

    ratio vector for this row is {44

    , 37

    , 04

    }, and so this row will be labeled 1 even though is

    in cluster 2. Therefore the entropy measure for R, or the row entropy measure for A,

    will be less than perfect. It turns out to be 0.1742.

    This measure, though not the method of labeling, is presented by Meyer in

    [23], and resembles the method presented in [25]. A similar cluster measure is also

    presented in [19], though here the entropy measure is relative to a perfect partition and

    so is not useful when such a partition is not known. A nice synopsis and comparison

    of several different cluster measuring techniques is presented in [16].

    4.2 Comparing results from Small Yahoo! Exam-

    ple

    The row re-ordering of the small Yahoo! dataset for SVD Signs is perfect and has a

    row entropy of zero! SVD Gaps does not do quite as well, but didnt do too badly,

    42

  • 7/27/2019 Clustering SVD Master Thesis

    45/86

    with a row entropy of 0.0279 Figure 4.6 shows the reordered row matrices side by

    side.

    Figure 4.6: Symmetric reordered small Yahoo! term matrix for SVD Signs reorderingwith entropy 0 (left) and for SVD Gaps with entropy 0.0279

    The column re-orderings were worse for both algorithms. The SVD Signs

    column reordering had an entropy of 0.1332, and it is obvious in Figure 4.7 that the

    column clusters of SVD Signs are not as good as its row clusters. SVD Gaps had a

    clearly worse score of 0.4493 for its column reordering.

    Figure 4.7: Symmetric reordered small Yahoo! column matrix for SVD Signs reorder-ing with entropy 0.1332 (left) and for SVD Gaps with entropy 0.4493

    43

  • 7/27/2019 Clustering SVD Master Thesis

    46/86

    Chapter 5

    Cluster Aggregation

    No matter how strong the theoretical or experimental evidence behind a clustering

    algorithm may be, nothing is perfect. All algorithms will have their strengths and

    weaknesses, and sometimes one algorithms weakness may be another algorithms

    strength. Thus it makes sense to find a way to combine them and bring out the best

    from several algorithms. This is exactly what cluster aggregation does.

    Before I give my algorithm for cluster aggregation I will demonstrate it on

    a small example. Suppose three different clustering algorithms are used on a small

    dataset containing eight objects, resulting in the three clusterings shown in Figure

    5.1.

    Figure 5.1: Small example of three different clusterings

    44

  • 7/27/2019 Clustering SVD Master Thesis

    47/86

    We can use these clusterings to build a graph that represents the relationships

    between the clusterings. For instance, since object 4 and 8 are clustered together for

    two of the algorithms, then there is an edge with weight 2 between nodes 4 and 8 on

    the graph shown in Figure 5.2. If two objects have no common clusters, then there

    is no edge between them.

    Figure 5.2: Graph representing three different clusterings of 8 objects

    Of course, any graph can be easily translated to an adjacency matrix A, as

    in Figure 5.3, where the nodes become rows and columns and the edges become the

    entries of the matrix. For instance, the 4th row of the 8th column and the 8th row of

    the 4th column are both 2. In general, if objects i and j have n common clusters,

    then Aij = Aji = n.

    Figure 5.3: Creating an adjacency matrix from the graph

    45

  • 7/27/2019 Clustering SVD Master Thesis

    48/86

    This adjacency matrix represents all three of the clustering algorithms, and

    clustering this matrix yields a nice aggregation of the three methods. Of course, this

    begs the question, what clustering method should be used? After all, if we knew the

    best clustering method, there would be no need for cluster aggregation in the first

    place. However, notice that the adjacency matrix is a square symmetric matrix. For

    these types of matrices, spectral methods, specifically those which use the signs of

    the eigenvectors as in [5] and [1] have actually been proven to be the best clustering

    algorithms. Therefore, I have chosen to use the method layed out in [5], and described

    in Section 2.2 of this thesis, to cluster the adjacency matrix. The results for the small

    example are shown in Figure 5.6.

    Figure 5.4: Results of Cluster Aggregation with small example

    The simplicity of this aggregation method gives it a surprising amount of

    flexibility and therefore a wide range of uses. In the small example used here, all of

    the algorithms were given equal weight in the adjacency matrix, and therefore seen as

    equally valid. However, in some instances, one or more of the clusterings might stand

    out above the rest. In these cases, it is easy to adjust the values in the adjacency

    matrix to reflect this disparity. Aggregation might also be used when a proper k value

    cannot be discerned for a given matrix. Instead of settling on one value ofk that may

    not be optimal, several values can be chosen, and their corresponding results can be

    aggregated.

    46

  • 7/27/2019 Clustering SVD Master Thesis

    49/86

    Many aggregation techniques have been introduced in the field of data mining,

    and some are quite similar to the one presented above. The aggregation algorithm

    introduced in [29] a hyper-graph that is equivalent to the adjacency matrix used

    here to capture the relationships between different clusterings. Another interesting

    aggregation algorithm is the one presented in [30], which is modeled after the way

    ant-colonies sort larvae and also makes use of the hyper-graphs presented by [29].

    5.1 Results on Small Yahoo! Dataset

    Figure 5.5: Adjacency Matrix of Small Yahoo! Dataset using clusters from SVD Signsand SVD Gaps

    47

  • 7/27/2019 Clustering SVD Master Thesis

    50/86

    Figure 5.6: Clustered adjacency matrix and the sorted term labels

    48

  • 7/27/2019 Clustering SVD Master Thesis

    51/86

    Chapter 6

    Experiments on Large datasets

    All the time spent researching and creating these algorithms is wasted if the algo-

    rithms themselves are useless when applied to large data sets. After all, one of the

    main reasons that clustering is so important is that it can be useful for breaking down

    and processing large amounts of information. Therefore, it is important at this point

    to demonstrate the results of the two SVD based clustering algorithms when applied

    to a couple of large data sets.

    6.1 Yahoo!

    The first data set is a binary 3,000 by 2,000 matrix complied by Yahoo! The matrix

    represents the relationships between 3,000 search terms and 2,000 (anonymous) ad-

    vertisers. If an advertiser j bought a given search phrase i, then there is a 1 in the

    ijth entry of the matrix; all other entries are 0.

    The image below, created with David Gleichs VISMATRIX tool, gives an idea

    of what the raw dataset looks like before reordering (the terms are in alphabetical

    order, and advertisers are in random order). Blue dots represent ones, and zeros are

    represented by whitespace.

    Before we run each of the algorithms on the full Yahoo! dataset, we must

    49

  • 7/27/2019 Clustering SVD Master Thesis

    52/86

    Figure 6.1: Vismatrix display of the raw 3000 by 2000 Yahoo! dataset.

    choose a value for k. Since the singular values can often be helpful for making this

    decision, Figure 6.2 shows a line graph of the first 100 singular values of the Yahoo!

    matrix.

    Figure 6.2: Plot of the singular values of the Yahoo! dataset.

    Unfortunately, the singular values plotted in Figure 6.2 do not provide a very

    clear answer. We can see that k = 20 would probably be too early a cut-off. I finally

    chose to compare the results for k = 25 and k = 35. Figure 6.3 shows the results for

    50

  • 7/27/2019 Clustering SVD Master Thesis

    53/86

    SVD signs.

    Figure 6.3: Results of the SVD Signs algorithm on the full Yahoo! dataset for k = 25(left) and k = 35 (right).

    SVD Signs performed quite nicely in both of these trials. In both, there are

    plenty of nice dense rectangles along the diameter. On exploring these clusters, we

    find that the terms in each cluster have similar themes: one contains search terms

    related to hotels, one contains terms related to online gambling, etc. Which choice

    for k yields better results? For k = 25, the row entropy is 0.0555 and the column

    entropy is 0.0482, whereas for k = 35 the average row and column entropies are

    It is clear from the pictures and entropy scores that using k = 25 results

    in better clusters; using a higher k results in too many clusters for both rows and

    columns. Next let us look at how the SVD Gaps algorithm performed. Figure 6.4

    shows the results using a tolerance level of 2.3. The SVD Gaps clustering got a row

    entropy of 0.1023 and a column entropy of 0.1421.

    How do the results compare? As we see in Figure 6.5, the SVD Signs method

    produced a much cleaner block diagonal structure than SVD Gaps. But, SVD Gaps

    seems to have found more dense clusters than SVD Signs. This seems to indicate that

    51

  • 7/27/2019 Clustering SVD Master Thesis

    54/86

    Figure 6.4: SVD Gaps results using tol = 2.3

    SVD Gaps is good at finding really strong clusters, but not good at finding weaker

    clusters.

    Figure 6.5: Side by side comparison of results from SVD Signs (left) and SVD Gaps(right).

    52

  • 7/27/2019 Clustering SVD Master Thesis

    55/86

    Figures 6.6 and 6.7 show side by side images of the reordered search phrase by

    reordered search phrase matrices, and the reordered advertiser by reordered advertiser

    matrices for SVD Signs and SVD Gaps so we can compare the row clusters and the

    column clusters for the two algorithms

    Figure 6.6: Term by Term matrices for SVD Signs (left) and SVD Gaps (right)

    Figure 6.7: Column by Column matrices for SVD Signs and SVD Gaps

    53

  • 7/27/2019 Clustering SVD Master Thesis

    56/86

    Again, the results for SVD Signs do look stronger at first glance, however

    although SVD Gaps did not perform as well on the dataset as a whole, it still did

    a better job on the really dense clusters. Also, We might conclude that SVD Gaps

    tends to focus only on the really strong clusters in a dataset and ignore the weaker

    ones. In some applications, this would be a very helpful trait.

    6.2 Wikipedia

    The next large dataset, shown in Figure 6.8, is a binary matrix representing links

    between 5,176 Wikipedia articles by 4,107 categories. Each article can be placed in

    multiple categories, and each category can contain several articles.

    Figure 6.8: Vismatrix display of the raw 5176 by 4107 Wikipedia dataset, and a plotof its first 100 singular values

    The results ofSVD Signs and SVD Gaps on the Wikipedia dataset are shown

    side by side in Figure 6.9. Neither algorithm produced a strong block diagonal struc-

    ture, but this is due to the nature of the dataset, and does not necessarily mean that

    54

  • 7/27/2019 Clustering SVD Master Thesis

    57/86

    the dataset was clustered poorly by either algorithm.

    Figure 6.9: Side by side comparison of the SVD Signs and SVD Gaps results for theWikipedia dataset.

    Though the SVD Gaps reordering does not look as nice, it actually got higher

    entropy scores. The row entropies for SVD Signs and SVD Gaps were 0.1257 and

    0.0956 respectively, while the column entropies were 0.3573 and 0.2013 resepectively.

    For this dataset, it is much easier to compare the results by looking at the reordered

    article by reordered article (see Figure 6.10) and reordered category by reordered

    category matrices (see Figure 6.11).

    Looking at Figure 6.10, it is more apparent that SVD Gaps produced a better

    clustering for the Wikipedia dataset. The structure is much cleaner, and there areeven lots of nice clusters within clusters, which are also present for SVD Signs but

    are not as well defined.

    The column reorderings show a similar story. The reordering for SVD Signs

    looks more interesting than that of SVD Gaps, but the clusters found by SVD Signs

    55

  • 7/27/2019 Clustering SVD Master Thesis

    58/86

    Figure 6.10: Reordered row by reordered row matrices for SVD Signs (left) and SVDGaps (right).

    Figure 6.11: Reordered column by reordered column matrices for SVD Signs (left)and SVD Gaps (right).

    are rather weak, and we can see there are a lot of groups of objects that are outside

    of yet rectilinearly aligned with the clusters. This suggests that many columns have

    been mis-clustered. SVD Gaps, on the other hand produced only very small and

    56

  • 7/27/2019 Clustering SVD Master Thesis

    59/86

    dense clusters, and there is not much noise in the rows and columns aligned with

    these clusters.

    6.3 Netflix

    The Netflix dataset used here is a 280 user by 17,770 movie dataset containing the

    users ratings (from 1 through 5, 5 being the best) of movies rented through Netflix.

    All the users in this dataset rated at least 500 movies. An image of this matrix is

    shown below in Figure 6.12.

    Figure 6.12: Vismatrix display of the raw 280 by 17,770 Netflix dataset and a plot ofits singular values.

    The results for both methods, shown in Figure 6.13, are interesting. Both

    methods seem to have gathered as much data as possible into a few clusters in the

    outside columns, and neither algorithm found very many column clusters.

    This seems odd, since the number of columns is so large, but makes more

    sense when the odd nature of the dataset is considered. Remember that the dataset

    only represents users who have rated more than 500 movies. Thus, every person

    represented in the matrix is a Netflix superuser who must like movies a great deal,

    57

  • 7/27/2019 Clustering SVD Master Thesis

    60/86

    Figure 6.13: Results for SVD Signs with k = 7 and for SVD Gaps with k = 7 andtol = 2

    and many of these 280 people probably share a lot of opinions. If all of the users are

    rating similar amounts of similar movies with similar scores, the dataset will be fairly

    homogenous, which will lead to strange and poor clusterings. After considering these

    aspects of the dataset, it makes more sense that both methods found a few very large

    and dense clusters.

    Unfortunately, because the matrix is so large, there was not enough memory in

    Matlab to compute the entropy scores for the SVD Signs and SVD Gaps clusterings

    of the Netflix dataset, and so we cannot conclude whether one algorithm preformed

    better than another on this dataset.

    58

  • 7/27/2019 Clustering SVD Master Thesis

    61/86

    Chapter 7

    Conclusion

    Thus far, SVD Signstends to be the more robust clustering algorithm. Because it only

    have one parameter, k, it is easier to work with and produces better overall clusterings

    on a more regular basis than SVD Gaps. With SVD Gaps method introduced here,

    we not only have to choose the appropriate value for k, we also have to choose an

    appropriate tolerance level for the dataset.

    However, there are some very positive aspects of the SVD Gaps method. The

    method showed again and again that it excelled in singling out the strongest clusters in

    the datatset, while ignoring weaker and less important clusters. In many applications

    of clustering, this might be a highly desirable quality. Regardless, SVD Gaps has

    not yet reached its full potential. It might be helpful to first learn more about

    the statistical distribution of a dataset before applying the SVD Gaps algorithm.

    This would help us to determine a tolerance level, and maybe even whether different

    tolerance levels should be used for different singular vectors.

    The Cluster Aggregation algorithm presented in Chapter 5 has a lot of promise

    as well, and it would have been nice to spend more time working and experimenting

    with it. Some expansions on this algorithm also need to be explored. Weighting

    clusterings before aggregating them (perhaps somehow inversely proportionate to

    59

  • 7/27/2019 Clustering SVD Master Thesis

    62/86

    their entropy scores) could improve results. Also, the algorithm could be used to

    aggregate clusterings for different values of k, especially in cases where one value

    produces too few clusters, but a higher value produces too many.

    Some other areas that need further work are the entropy measure, the method

    of choosing k, and cluster ordering. The entropy measure works nicely, but the

    labeling scheme needs to be expanded to work on more dense matrices, as it now

    works only for sparse matrices. Since, the success of either SVD Signs or SVD Gaps

    depends so much on the choice of k, a more reliable method must be found and

    employed for both algorithms. Finally, it would be nice if clusters were ordered and

    displayed so that clusters next to each other are most similar; this would be most

    useful when analyzing a clustering for data mining purposes.

    60

  • 7/27/2019 Clustering SVD Master Thesis

    63/86

    Appendix A

    MATLAB Code

    A.1 SVD Signs

    function []=svdclusterrect(A,k,centeringtoggle,trms,docs);

    %% INPUT: A = m by n matrix

    %% k = number of principal directions to compute

    %% centeringtoggle = 1 if you want to center the data matrix A

    %% first and work on thet centered matrix C

    %% = 0 if you want to work with uncentered data

    %% matrix A

    %% trms = row labels

    %% docs = column labels

    if centeringtoggle==1

    mu = A*ones(n,1)/n;

    61

  • 7/27/2019 Clustering SVD Master Thesis

    64/86

    A= A-mu*ones(1,n); % A is now the centered A matrix

    end

    [U,S,V]=fastsvds(A,k); %finds truncated SVD of A

    % to find a row reordering that clusters rows

    E=(U>=0);

    %finds positive entries of left singular vectors

    x=zeros(m,1);

    %creates vector of zeros

    for i=1:k;

    x=x+(2^(i-1))*(E(:,k-i+1));

    end

    %designates sign pattern to each row of U

    [sortedrowx,rowindex]=sort(x);

    %sorts x by sign patterns of U

    numrowclusters=length(unique(x))

    %finds the number of row clusters of A

    % to find a column reordering that clusters columns

    F=(V>=0);

    62

  • 7/27/2019 Clustering SVD Master Thesis

    65/86

    % finds all positive entries of matrix V

    y=zeros(n,1); %creates vector of zeros

    for i=1:k;

    y=y+(2^(i-1))*(F(:,k-i+1));

    end

    %designates signs patterns to each row of V

    [sortedcoly,colindex]=sort(y);

    %sorts y by sign patterns of V

    numcolclusters=length(unique(y))

    %finds number of column clusters of A

    if centeringtoggle==1

    A= A+mu*ones(1,n);

    % A changed back to the original uncentered A matrix

    end

    reorderedA=A(rowindex,colindex);

    %reorders A into clustered form

    spy(reorderedA)

    %spyplot of reordered matrix

    trms=cellstr(trms);

    63

  • 7/27/2019 Clustering SVD Master Thesis

    66/86

    reorderedterms=trms(rowindex);

    %reorders row labels

    docs=cellstr(docs);

    reordereddocs=docs(colindex);

    %reorders column labels

    er=entropy2(reorderedA*reorderedA,x,m,numrowclusters)

    %finds row entropy of reordered A

    ec=entropy2(reorderedA*reorderedA,y,n,numcolclusters)

    %finds column entropy of reordered A

    cd /Users/langvillea/David/vismatrix2

    vismatrix2(reorderedA,reorderedterms,reordereddocs)

    cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix

    %sends reordered A to VISMATRIX tool

    A.2 SVD Gaps

    function []=svdgapcluster2(A,k,centeringtoggle,confidence,

    confidence2,termlabels,doclabels)

    %% INPUT: A = m by n matrix

    %% k = number of principal directions to compute

    %% centeringtoggle = 1 if you want to center the data matrix A

    %% first and work on the centered matrix C

    64

  • 7/27/2019 Clustering SVD Master Thesis

    67/86

    %% = 0 if you want to work with uncentered data

    %% matrix A

    %% confidence = tolerance level for rows

    %% confidence = tolerance level for columns

    %% termlabels = labels for rows

    %% doclabels = labels for columns

    [m,n]=size(A);

    % finds the dimensions of A

    if centeringtoggle==1

    mu = A*ones(n,1)/n;

    A= A-mu*ones(1,n); % A is now the centered A matrix

    end

    [U,S,V]=fastsvds(A,k);

    % finds truncated SVD of A

    % later do smart implementation of svd on centered data

    %using rank-one udpate rules.

    %% for Term Clustering

    [sortedU,index]=sort(U);

    %sort left singular vectors

    gapmatrix=sortedU(2:m,:)-sortedU(1:m-1,:);

    65

  • 7/27/2019 Clustering SVD Master Thesis

    68/86

    %each column of gapmatrix contains gaps of a left singular vector

    gapmeans=mean(gapmatrix,1);

    %find mean gap size for each vector

    stdgaps=std(gapmatrix,1,1);

    %find st. dev of gap size for each vector

    gapszscore=(gapmatrix-ones(m-1,1)*gapmeans)./(ones(m-1,1)*stdgaps);

    %convert all gaps to z-scores

    [row,col]=find(gapszscore>confidence);

    %find indices for all z-scores that are greater than

    %tolerance level for rows

    D=full(sparse(row,col,ones(length(row),1),m,k));

    %creates binary sparse matrix whose columns

    %contain ones to mark where large gaps in sing vectors occur

    C=zeros(m,k);

    for j=1:k

    count=0;

    for i=1:m

    C(i,j)=count+1; %creates cluster label matrix

    if D(i,j)==1

    count=count+1;

    %cluster label changes where large gaps occur

    end

    end

    end

    66

  • 7/27/2019 Clustering SVD Master Thesis

    69/86

    % matrix C is the matrix of cluster labels

    % need to sort C by index matrix

    [sortedindex,IIndex]=sort(index,1);

    for i=1:k

    C(:,i)=C((IIndex(:,i)),i);

    end

    %each column of cluster labels is now sorted

    [b,i,h]=unique(C,rows);

    %finds the rows of C with the same label patterns

    %these rows will be clustered together

    [termclusters,termclusterindex]=sort(h);

    %finds row reordering for A

    C(termclusterindex,:);

    %reorders the cluster label matrix

    %rows with same cluster label patterns will be adjacent

    numtrmclusters=size(b)

    %finds number of row clusters of A

    %% For Doc Clustering

    [sortedV,indexV]=sort(V);

    %sort right singular vectors

    gapmatrix=sortedV(2:n,:)-sortedV(1:n-1,:);

    %each column of gapmatrix contains gaps of a right singular vector

    gapmeans=mean(gapmatrix,1);

    67

  • 7/27/2019 Clustering SVD Master Thesis

    70/86

    %find mean gap size for each vector

    stdgaps=std(gapmatrix,1,1);

    %find st. dev of gap size for each vector

    gapszscore=(gapmatrix-ones(n-1,1)*gapmeans)./(ones(n-1,1)*stdgaps);

    %convert all gaps to z-scores

    [row,col]=find(gapszscore>confidence2);

    %find indices for all z-scores that are

    %greater than tolerance level for columns

    F=full(sparse(row,col,ones(length(row),1),n,k));

    %creates binary sparse matrix whose columns

    %contain ones to mark where large gaps in sing vectors occur

    E=zeros(n,k);

    for j=1:k

    count=0;

    for i=1:n

    E(i,j)=count+1; %creates cluster label matrix

    if F(i,j)==1

    count=count+1;

    %cluster label changes where large gaps occur

    end

    end

    end

    % matrix E is the matrix of cluster labels

    % need to sort E by index matrix

    68

  • 7/27/2019 Clustering SVD Master Thesis

    71/86

    [sortedindexV,IIndexV]=sort(indexV,1);

    for i=1:k

    E(:,i)=E((IIndexV(:,i)),i);

    end

    %each column of cluster labels is now sorted

    [b,i,z]=unique(E,rows);

    %finds the rows of C with the same label patterns

    %these rows represent columns of A that will be clustered together

    [docclusters,docclusterindex]=sort(z);

    %finds row reordering for A

    E(docclusterindex,:);

    %reorders the cluster label matrix

    %rows with same cluster label patterns will be adjacent

    numdocclusters=size(b)

    %finds number of row clusters of A

    if centeringtoggle==1

    A= A+mu*ones(1,n);

    end

    % A is now back to the original uncentered A matrix

    reorderedA=A(termclusterindex,docclusterindex);

    %reorders A into clustered form

    spy(reorderedA)

    %spyplot of reordered A

    69

  • 7/27/2019 Clustering SVD Master Thesis

    72/86

    doclabels=cellstr(doclabels);

    doclabels=doclabels(docclusterindex);

    %reorders column labels of A

    termlabels=cellstr(termlabels);

    termlabels=termlabels(termclusterindex);

    %reorders row labels of A

    er=entropy2(reorderedA*reorderedA,h,m,numtrmclusters(1))

    %finds row entropy of reordered A

    ec=entropy2(reorderedA*reorderedA,z,n,numdocclusters(1))

    %finds column entropy of reordered A

    cd /Users/langvillea/David/vismatrix2

    vismatrix2(reorderedA*reorderedA, termlabels, doclabels)

    cd /Users/langvillea/Desktop/datavisSAS-Meyer/vismatrix

    %sends reordered A to VISMATRIX tool

    A.3 Entropy Measure

    function [entropy]=entropy2(A,clusters,n,k)

    %input A is symmetric row by row or column by column

    %n is the number of rows of A

    %k is the number of clusters of A

    x=sort(clusters); %clusterlabels in descending order

    70

  • 7/27/2019 Clustering SVD Master Thesis

    73/86

    [m,xstart]=unique(x,first); %find cluster start row

    [m,xstop]=unique(x,last); %find cluster stop row

    spy(A)

    Q=zeros(n,k);

    for i=1:n

    for j=1:k

    maxvector(j)=nnz(A(i,xstart(j):xstop(j)))/((xstop(j)+1)-xstart(j));

    end

    [maxratio,index]=max(maxvector);

    Q(i,index(1))=1;

    end

    %forces row labels using max ratio vector

    P=zeros(k,k);

    for j=1:k

    for i=1:k

    P(j,i)=sum(Q(xstart(j):xstop(j),i))/((xstop(j)+1)-xstart(j));

    end

    end

    %creates a matrix of p(i,j)s

    P=(P.*log(P))/log(k);

    [row,col]=find(isnan(P));

    for i=1:length(row)

    71

  • 7/27/2019 Clustering SVD Master Thesis

    74/86

    P(row(i),col(i))=0;

    end

    %if p(i,j)=0 then set p(i,j)*log p(i,j)=0

    for i=1:k

    H(i)=-sum(P(i,:));

    alpha(i)=((xstop(i)+1)-xstart(i))/n;

    end

    %finds entropy measure for each cluster

    entropy=sum(alpha.*H)

    %finds entropy of entire partition

    A.4 Cluster Aggregation

    %%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%%%%%

    %% INPUT : L = n-by-p matrix of cluster results;

    %% column i contains results from clustering method i

    %% OUTPUT : A = n-by-n weighted undirected (symmetric)

    %% Aggregation matrix;

    %% A(i,j) = # of methods having items i and j in same cluster

    %% EXAMPLE INPUT: L=[1 3 1;3 1 2; 2 2 2 ; 1 3 1; 1 1 3; 2 2 2]

    % % L =

    %%

    %% 1 3 1

    72

  • 7/27/2019 Clustering SVD Master Thesis

    75/86

    %% 3 1 2

    %% 2 2 2

    %% 1 3 1

    %% 1 1 3

    %% 2 2 2

    %% means that clustering method 1 (info. in col. 1 of L) groups items

    %% 1, 4 and 5 together, then

    %% items 3 and 5, and finally item 2 in its own cluster, creating

    %% a total of three clusters. Clustering method 2

    %% (info. in col. 2 of L)

    %% groups items 1 and 4, then 3 and 6, and finally 2 and 5. Notice

    %% that the cluster assignment labels used by one clustering

    %% method do not need to match those from another method. And

    %% the number of clusters found by each method do not need to

    %% match either. Yet all lists must be full, i.e., have the same

    %% number of items.

    function [] = ClusterAgg(L,labels);

    % n = # of items/documents

    % p = # of lists of cluster assignment results = # of methods

    [n,p]=size(L);

    A=zeros(n,n);

    % need to do p*(n choose 2) pairwise comparisons to create

    %Aggregation matrix A.

    for i=1:n

    73

  • 7/27/2019 Clustering SVD Master Thesis

    76/86

    for j=i+1:n

    matchcount=0;

    for k=1:p

    if L(i,k)==L(j,k)

    matchcount=matchcount+1;

    end

    end

    A(i,j)=matchcount;

    end

    end

    A=A+A;

    %Now run any clustering method you like on Aggregation matrix A

    %For example, code for running the extended Fiedler method is below.

    %The Fiedler method is a good choice since the graph is undirected.

    % F = Fiedler matrix F=D-A

    D=diag(A*ones(n,1));

    F=D-A;

    % k = # of eigenvectors to use for extended Fiedler

    k=2;

    [FiedlerVector,evalue]=eigs(F,k+1,sa);

    FiedlerVector=FiedlerVector(:,2:k+1)

    U=(FiedlerVector>=0);

    74

  • 7/27/2019 Clustering SVD Master Thesis

    77/86

    x=zeros(n,1);

    for l=1:k;

    x=x+(2^(k-l))*(U(:,l));

    end

    % x contains the cluster assignments

    x

    % numclusters is the number of clusters produced by the aggregated

    %method

    numclusters=length(unique(x))

    %%%%%%%%%%%%%%%%%% ClusterAgg.m %%%%%%%%%%%%

    A.5 Other Code Used

    %%%%%%%%%%%BEGIN GLEICHS fastsvds.m%%%%%%%%%%%

    function [U, S, V] = fastsvds(varargin)

    % fastsvds performs the singular value decomposition more quickly than

    % matlabs built in function.

    %

    % [U S V] = fastsvds(A, k, sigma, opts)

    %

    % see svds for a description of the parameters.

    %

    75

  • 7/27/2019 Clustering SVD Master Thesis

    78/86

    A = varargin{1};

    [m,n] = size(A);

    p = min(m,n);

    if nargin < 2

    k = min(p,6);

    else

    k = varargin{2};

    end

    if nargin < 3

    bk = min(p,k);

    if isreal(A)

    bsigma = LA;

    else

    bsigma = LR;

    end

    else

    sigma = varargin{3};

    if sigma == 0 % compute a few extra eigenvalues to be safe

    bk = 2 * min(p,k);

    else

    bk = k;

    end

    if strcmp(sigma,L)

    if isreal(A)

    bsigma = LA;

    76

  • 7/27/2019 Clustering SVD Master Thesis

    79/86

    else

    bsigma = LR;

    end

    elseif isa(sigma,double)

    bsigma = sigma;

    if ~isreal(bsigma)

    error(Sigma must be real);

    end

    else

    error(Third argument must be a scalar or the string L)

    end

    end

    if isreal(A)

    boptions.isreal = 1;

    else

    boptions.isreal = 0;

    end

    boptions.issym = 1;

    if nargin < 4

    % norm(B*W-W*D,1) / norm(B,1) norm(A*V-U*S,1) / norm(A,1)

  • 7/27/2019 Clustering SVD Master Thesis

    80/86

    options = varargin{4};

    if isstruct(options)

    if isfield(options,tol)

    boptions.tol = options.tol / sqrt(2);

    else

    boptions.tol = 1e-10 / sqrt(2);

    end

    if isfield(options,maxit)

    boptions.maxit = options.maxit;

    end

    if isfield(options,disp)

    boptions.disp = options.disp;

    else

    boptions.disp = 0;

    end

    else

    error(Fourth argument must be a structure of options.)

    end

    end

    if (m > n)

    % this means we want to find the right singular vectors first

    % [V D] = eigs(A*A)

    %f = inline(global AFASTSVDMATRIX;

    %AFASTSVDMATRIX*(AFASTSVDMATRIX*v), v);

    [V D] = eigs(@multiply_mtm, n, bk, bsigma, boptions, A);

    [dummy, perm] = sort(-diag(D));

    78

  • 7/27/2019 Clustering SVD Master Thesis

    81/86

    S = diag(sqrt(diag(D(perm, perm))));

    V = V(:, perm);

    Sinv = diag(1./sqrt(diag(D)));

    U = (A*V)*Sinv;

    else

    % find the left singular vectors first

    % [U D] = eigs(A*A)

    %f = inline(global AFASTSVDMATRIX; A*(A*v), v);

    [U D] = eigs(@multiply_mmt, m, bk, bsigma, boptions, A);

    [dummy, perm] = sort(-diag(D));

    S = diag(sqrt(diag(D(perm, perm))));

    U = U(:, perm);

    Sinv = diag(1./sqrt(diag(D)));

    V = Sinv*(U*A);

    V = V;

    end;

    if nargout

  • 7/27/2019 Clustering SVD Master Thesis

    82/86

    function mmtv = multiply_mmt(v, A)

    mmtv = A*(A*v);

    %global AFASTSVDMATRIX;

    %mmtv = AFASTSVDMATRIX*(AFASTSVDMATRIX*v);

    %%%%%%%%%%%%%END GLEICHS fastsvds.m%%%%%%%%%%

    %%%%%%%%%%readSMAT.m%%%%%%%%%%%%

    function A = readSMAT(filename)

    % readSMAT reads an indexed sparse matrix representation of

    % a matrix and creates a MATLAB sparse matrix.

    %

    % A = readSMAT(filename)

    % filename - the name of the SMAT file

    % A - the MATLAB sparse matrix

    %

    s = load(filename);

    m = s(1,1);

    n = s(1,2);

    ind_i = s(2:length(s),1)+1;

    ind_j = s(2:length(s),2)+1;

    val = s(2:length(s),3);

    A = sparse(ind_i,ind_j,val, m, n);

    %%%%%%%%%%end readSMAT.m%%%%%%%%%%

    80

  • 7/27/2019 Clustering SVD Master Thesis

    83/86

    Bibliography

    [1] Charles J. Alpert, Andrew B. Kahng, and So zen Yao. Spectral partitioning: The

    more eigenvectors, the better. In Proc. ACM/IEEE Design Automation Conf,

    pages 195200, 1995.

    [2] Reid Andersen, Fan Chung, and Kevin Lang. Local graph partitioning using

    PageRank vectors. In Proceedings of the 47th Annual IEEE Symposium on Foun-

    dations of Computer Science, 2006.

    [3] Francis R. Bach and Michael I. Jordan. Finding clusters in independent compo-

    nent analysis. In In: 4th Intl. Symp. on Independent Component Analysis and

    Signal Separation (ICA2003, pages 891896, 2003.

    [4] Barbara E. Ball. Clustering directed graphs without symmetrization, Masters

    Thesis, College of Charleston, 2006.

    [5] Ibai E. Basabe. A new way to cluster data, Masters Thesis, College of Charleston,

    2007.

    [6] M. W. Berry, S. T. Dumais, G. W. Obrien, Michael W. Berry, Susan T. Dumais,

    and Gavin. Using linear algebra for intelligent information retrieval. SIAM

    Review, 37:573595, 1995.

    81

  • 7/27/2019 Clustering SVD Master Thesis

    84/86

    [7] Michael W. Berry and Murray Browne. Understanding search engines: mathe-

    matical modeling and text retrieval. Society for Industrial and Applied Mathe-

    matics, Philadelphia, PA, USA, 1999.

    [8] Mike W. Berry and Murray Browne. Lecture Notes in Data Mining. World

    Scientific Publishing Co., 2006.

    [9] Daniel Boley. Principal direction divisive partitioning. Data Mining and Knowl-

    edge Discovery, 2:325344, 1998.

    [10] Fan R.K. Chung. Spectral Graph Theory. Number 92. American Mathematical

    Society, 1997.

    [11] Matt Rasmussen David Gleich, Leonid Zhukov. Vismatrix, 2006.

    [12] Brian S. Everitt, Sabine Landau, and Morven Leese. Cluster Analysis. Arnold

    Publishers, May 2001.

    [13] Miroslav Fiedler. Algebraic connectivity of graphs. Czech