Other Clustering Algorithms

Embed Size (px)

Citation preview

  • 8/3/2019 Other Clustering Algorithms

    1/42

    MST: Divisive Hierarchical Clustering

    q

    Build MST (Minimum Spanning Tree) Start with a tree that consists of any point In successive steps, look for the closest pair of points (p, q) such

    that one point (p) is in the current tree but the other (q) is not

    Add q to the tree and put an edge between p and q

  • 8/3/2019 Other Clustering Algorithms

    2/42

    MST: Divisive Hierarchical Clustering

    q

    Use MST for constructing hierarchy of clusters

  • 8/3/2019 Other Clustering Algorithms

    3/42

    DBSCAN

    q DBSCAN is a density-based algorithm. Density = number of points within a specified radius (Eps)

    A point is a core point if it has more than a specified number

    of points (MinPts) within Eps

    These are points that are at the interior of a cluster

    A border point has fewer than MinPts within Eps, but is inthe neighborhood of a core point

    A noise point is any point that is not a core point or a borderpoint.

  • 8/3/2019 Other Clustering Algorithms

    4/42

    DBSCAN: Core, Border, and Noise Points

  • 8/3/2019 Other Clustering Algorithms

    5/42

    DBSCAN Algorithm

    q

    Eliminate noise pointsq Perform clustering on the remaining points

  • 8/3/2019 Other Clustering Algorithms

    6/42

    DBSCAN: Core, Border and Noise Points

    Original Points Point types: core,

    border and noise

    Eps = 10, MinPts = 4

  • 8/3/2019 Other Clustering Algorithms

    7/42

    When DBSCAN Works Well

    Original Points

    Clusters

    Resistant to Noise

    Can handle clusters of different shapes and sizes

  • 8/3/2019 Other Clustering Algorithms

    8/42

    When DBSCAN Does NOT Work Well

    Original Points

    (MinPts=4, Eps=9.75).

    (MinPts=4, Eps=9.92)

    Varying densities

    High-dimensional data

  • 8/3/2019 Other Clustering Algorithms

    9/42

    DBSCAN: Determining EPS and MinPts

    q Idea is that for points in a cluster, their kth nearest

    neighbors are at roughly the same distanceq Noise points have the kth nearest neighbor at farther

    distance

    q So, plot sorted distance of every point to its kth

    nearest neighbor

  • 8/3/2019 Other Clustering Algorithms

    10/42

    Cluster Validity

    q For supervised classification we have a variety of

    measures to evaluate how good our model is Accuracy, precision, recall

    q For cluster analysis, the analogous question is how toevaluate the goodness of the resulting clusters?

    q But clusters are in the eye of the beholder!

    q Then why do we want to evaluate them?

    To avoid finding patterns in noise To compare clustering algorithms

    To compare two sets of clusters

    To compare two clusters

  • 8/3/2019 Other Clustering Algorithms

    11/42

    Graph Theoretic Interpretation

    As we calculate the nearest pair to join. Willthis distance go up or down?

    Let dn now be the distance between the

    clusters merged at step n (counting frombottom)

    Let G(dn) be a graph that has edges betweenall data points within distance dn of each

    other. Then the clusters after step n are theconnected components of G(dn). Why?

    Explains potentially long straggly clusters.

    Assumptions?

  • 8/3/2019 Other Clustering Algorithms

    12/42

    How to Define Inter-Cluster Similarity

    q MIN

    q MAX

    q Group Average

    q Distance Between Centroids

    q Other methods driven by an objective function Wards Method uses squared error

    Paraphrasing The President:

    Your either part of my

    cluster or your not

  • 8/3/2019 Other Clustering Algorithms

    13/42

    Graph Theoretic Interpretation

    Assume each pair of points has a uniquedistance. Let d

    nbe the diameterof the

    cluster formed at step n.

    Let G(dn) be a graph that has edges betweenall data points within dn distance from one

    another.

    Then the clusters after step n are *cliques* ofG(dn). Why?

    How can we make use of these graphtheoretic interpretations?

  • 8/3/2019 Other Clustering Algorithms

    14/42

    How to Define Inter-Cluster Similarity

    q MIN

    q MAX

    q Group Averageq Distance Between Centroids

    q Other methods driven by an objective function Wards Method uses squared error

    Paraphrasing The President:

    Your either part of my

    cluster or your not

  • 8/3/2019 Other Clustering Algorithms

    15/42

    Graph-Based Clustering

    q Graph-Based clustering uses the proximity graph

    Start with the proximity matrix

    Consider each point as a node in a graph

    Each edge between two nodes has a weight which is

    the proximity between the two points Initially the proximity graph is fully connected

    MIN (single-link) and MAX (complete-link) can be

    viewed as starting with this graph

    q In the simplest case, clusters are connected

    components in the graph.

  • 8/3/2019 Other Clustering Algorithms

    16/42

  • 8/3/2019 Other Clustering Algorithms

    17/42

    Graph-Based Clustering: Sparsification

    q Clustering may work better Sparsification techniques keep the connections to the most

    similar (nearest) neighbors of a point while breaking theconnections to less similar points.

    The nearest neighbors of a point tend to belong to the sameclass as the point itself.

    This reduces the impact of noise and outliers and sharpens

    the distinction between clusters.

    q Sparsification facilitates the use of graphpartitioning algorithms (or algorithms basedon graph partitioning algorithms.

    Chameleon and Hypergraph-based Clustering

  • 8/3/2019 Other Clustering Algorithms

    18/42

    Sparsification in the Clustering Process

  • 8/3/2019 Other Clustering Algorithms

    19/42

    Chameleon: Clustering Using Dynamic

    Modeling

    q Adapt to the characteristics of the data set to find the

    natural clusters

    q Use a dynamic model to measure the similarity between

    clusters

    Main property is the relative closeness and relative inter-

    connectivity of the cluster

    Two clusters are combined if the resulting cluster shares certain

    properties with the constituent clusters

    The merging scheme preserves self-similarity

    q One of the areas of application is spatial data

  • 8/3/2019 Other Clustering Algorithms

    20/42

    Characteristics of Spatial Data Sets

    Clusters are defined as densely

    populated regions of the space

    Clusters have arbitrary shapes,

    orientation, and non-uniform sizes

    Difference in densities across clusters

    and variation in density within clusters

    Existence of special artifacts (streaks)

    and noise

    The clustering algorithm must addressthe above characteristics and also

    require minimal supervision.

  • 8/3/2019 Other Clustering Algorithms

    21/42

    Chameleon: Steps

    q Preprocessing Step:Represent the Data by a Graph Given a set of points, construct the k-nearest-

    neighbor (k-NN) graph to capture the relationshipbetween a point and its k nearest neighbors

    Concept of neighborhood is captured dynamically(even if region is sparse)

    q Phase 1: Use a multilevel graph partitioning

    algorithm on the graph to find a large number ofclusters of well-connected vertices Each cluster should contain mostly points from one

    true cluster, i.e., is a sub-cluster of a real cluster

  • 8/3/2019 Other Clustering Algorithms

    22/42

    Chameleon: Steps

    q Phase 2: Use Hierarchical Agglomerative

    Clustering to merge sub-clusters

    Two clusters are combined if the resulting cluster

    shares certain properties with the constituent clusters

    Two key properties used to model cluster similarity:Trying to capture how similar two sub-graphs are ...

    Relative Interconnectivity: Absolute interconnectivity of two

    clusters normalized by the internal connectivity of the clusters

    Relative Closeness: Absolute closeness of two clusters

    normalized by the internal closeness of the clusters

  • 8/3/2019 Other Clustering Algorithms

    23/42

    Experimental Results: CHAMELEON

  • 8/3/2019 Other Clustering Algorithms

    24/42

    Experimental Results: CHAMELEON

    E i t l R lt CURE (10

  • 8/3/2019 Other Clustering Algorithms

    25/42

    Experimental Results: CURE (10clusters)

    E i t l R lt CURE (15

  • 8/3/2019 Other Clustering Algorithms

    26/42

    Experimental Results: CURE (15clusters)

  • 8/3/2019 Other Clustering Algorithms

    27/42

    Experimental Results: CHAMELEON

    E i t l R lt CURE (9

  • 8/3/2019 Other Clustering Algorithms

    28/42

    Experimental Results: CURE (9clusters)

    E perimental Res lts CURE (15

  • 8/3/2019 Other Clustering Algorithms

    29/42

    Experimental Results: CURE (15clusters)

  • 8/3/2019 Other Clustering Algorithms

    30/42

    i j i j4

    SNN graph: the weight of an edge is the number of sharedneighbors between vertices given that the vertices are connected

    Shared Near Neighbor Approach

  • 8/3/2019 Other Clustering Algorithms

    31/42

    Creating the SNN Graph

    Sparse Graph

    Link weights are similarities

    between neighboring points

    Shared Near Neighbor Graph

    Link weights are number of

    Shared Nearest Neighbors

    ROCK (RObust Clustering using

  • 8/3/2019 Other Clustering Algorithms

    32/42

    ROCK (RObust Clustering usinglinKs)q Clustering algorithm for data with categorical and

    Boolean attributes A pair of points is defined to be neighbors if their similarity is greater

    than some threshold

    Use a hierarchical clustering scheme to cluster the data.

    2. Obtain a sample of points from the data set3. Compute the link value for each set of points, i.e., transform the

    original similarities (computed by Jaccard coefficient) into similaritiesthat reflect the number of shared neighbors between points

    4. Perform an agglomerative hierarchical clustering on the data using

    the number of shared neighbors as similarity measure andmaximizing the shared neighbors objective function

    5. Assign the remaining points to the clusters that have been found

    i i k Cl i

  • 8/3/2019 Other Clustering Algorithms

    33/42

    Jarvis-Patrick Clustering

    q First, the k-nearest neighbors of all points are found

    In graph terms this can be regarded as breaking all but the k

    strongest links from a point to other points in the proximity graph

    q A pair of points is put in the same cluster if

    any two points share more than T neighbors and the two points are in each others k nearest neighbor list

    q For instance, we might choose a nearest neighbor list of

    size 20 and put points in the same cluster if they sharemore than 10 near neighbors

    q Jarvis-Patrick clustering is too brittle

    When Jarvis Patrick Works Reasonably

  • 8/3/2019 Other Clustering Algorithms

    34/42

    When Jarvis-Patrick Works Reasonably

    Well

    Original Points Jarvis Patrick Clustering

    6 shared neighbors out of 20

    When Jarvis Patrick Does NOT Work

  • 8/3/2019 Other Clustering Algorithms

    35/42

    Smallest threshold, T,

    that does not merge

    clusters.

    Threshold of T - 1

    When Jarvis-Patrick Does NOT Work

    Well

    SNN Cl t i Al ith

  • 8/3/2019 Other Clustering Algorithms

    36/42

    SNN Clustering Algorithm

    q Compute the similarity matrix

    This corresponds to a similarity graph with data points for nodes andedges whose weights are the similarities between data points

    q Sparsify the similarity matrix by keeping only the kmost similarneighborsThis corresponds to only keeping the kstrongest links of the similarity

    graphq Construct the shared nearest neighbor graph from the sparsified

    similarity matrix.At this point, we could apply a similarity threshold and find theconnected components to obtain the clusters (Jarvis-Patrick

    algorithm)q Find the SNN density of each Point.

    Using a user specified parameters,Eps, find the number points thathave an SNN similarity of Eps or greater to each point. This is theSNN density of the point

  • 8/3/2019 Other Clustering Algorithms

    37/42

    SNN Clustering Algorithm

    q Find the core points

    Using a user specified parameter, MinPts, find the corepoints, i.e., all points that have an SNN density greaterthan MinPts

    q Form clusters from the core pointsIf two core points are within a radius, Eps, of each other

    they are place in the same clusterq Discard all noise points

    All non-core points that are not within a radius ofEps ofa core point are discarded

    q Assign all non-noise, non-core points to clustersThis can be done by assigning such points to thenearest core point

    (Note that steps 4-8 are DBSCAN)

    SNN D it

  • 8/3/2019 Other Clustering Algorithms

    38/42

    SNN Density

    a) All Points b) High SNN Density

    c) Medium SNN Density d) Low SNN Density

  • 8/3/2019 Other Clustering Algorithms

    39/42

    SNN Clustering Can Handle Differing Densities

    Original Points SNN Clustering

    SNN Clustering Can Handle Other Difficult

  • 8/3/2019 Other Clustering Algorithms

    40/42

    SNN Clustering Can Handle Other Difficult

    Situations

    Finding Clusters of Time Series In Spatio Temporal

  • 8/3/2019 Other Clustering Algorithms

    41/42

    Finding Clusters of Time Series In Spatio-Temporal

    Data

    26 SLP Clusters via Shared Nearest Neighbor Clustering (100 NN, 1982-1994)

    longitude

    lat

    itude

    -1 80 -1 50 -12 0 -9 0 -6 0 -3 0 0 3 0 6 0 9 0 1 20 1 50 1 80

    90

    60

    30

    0

    -30

    -60

    -90

    1326

    2425

    22

    14

    16 20 17 18

    19

    15

    23

    1 9

    6

    4

    7101211

    3

    52

    8

    21

    SNN Clusters of SLP.

    SNN Density of SLP Time Series Data

    longitude

    latitude

    -180 -150 -120 -90 -60 -30 0 30 60 90 120 150 180

    90

    60

    30

    0

    -30

    -60

    -90

    SNN Density of Points on the Globe.

    Features and Limitations of SNN

  • 8/3/2019 Other Clustering Algorithms

    42/42

    Features and Limitations of SNN

    Clustering

    q Does not cluster all the points

    q Complexity of SNN Clustering is high O( n * time to find numbers of neighbor within Eps)

    In worst case, this is O(n2) For lower dimensions, there are more efficient ways to find

    the nearest neighbors

    R* Tree

    k-d Trees