Clustering - cs.unc.edu fileClustering COMP 790COMP 790-90 Research Seminar90 Research Seminar BCB 713 Module Spring 2011 Wei Wang The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

ClusteringClusteringCOMP 790-90 Research SeminarCOMP 790 90 Research Seminar

BCB 713 ModuleSpring 2011Wei Wang

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Drawback of Distance-based Methods

Hard to find clusters with irregular shapes

Hard to specify the number of clustersHard to specify the number of clusters

Heuristic: a cluster must be dense

COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2

Directly Density ReachableDirectly Density ReachablepMinPts = 3

ParametersEps: Maximum radius of the neighborhood

qEps = 1 cm

p gMinPts: Minimum number of points in an Eps-neighborhood of that pointNEps(p): {q | dist(p,q) Eps}

Core object p: |Neps(p)|MinPtsPoint q directly density-reachable from p iff q Neps(p) and p is a core object


Density-Based Clustering: Background (II) p

p1

Density-reachableDirectly density reachable p p p p

qp1

Directly density reachable p1p2, p2p3, …, pn-1 pn pn density-reachable from p1

Density-connectedDensity-connectedPoints p, q are density-reachable from o p and q are density-connectedand q are density-connected

p q


o

DBSCANDBSCAN

A cluster: a maximal set of density-connected pointsconnected points

Discover clusters of arbitrary shape in spatial databases with noisedatabases with noise

Outlier

Core

BorderEps = 1cm

MinPts = 5


Core MinPts 5

DBSCAN: the AlgorithmDBSCAN: the Algorithm

Arbitrary select a point pRetrieve all points density-reachable from p wrt p y pEps and MinPtsIf p is a core point, a cluster is formedIf p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the databasepoint of the databaseContinue the process until all of the points have been processed


p

Problems of DBSCANProblems of DBSCAN

Different clusters may have very different densitiesdensities

Clusters may be in hierarchies


OPTICS: A Cluster-ordering Method

OPTICS: ordering points to identify the clustering structureclustering structure

“Group” points by density connectivityHi hi f l tHierarchies of clusters

Visualize clusters and the hierarchy


DENCLUE: Using Density Functions

DENsity-based CLUstEring

Major featuresMajor featuresSolid mathematical foundation

Good for data sets with large amounts of noise

Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets

Si ifi l f h i i l i h (f hSignificantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45)

But need a large number of parameters


But need a large number of parameters

Grid-based Clustering Methods

IdeasUsing multi-resolution grid data structuresUsing multi-resolution grid data structures

Use dense grid cells to form clusters

S l i t ti th dSeveral interesting methodsSTING

lWaveCluster

CLIQUE


STING: A Statistical Information Grid Approach

The spatial area area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution


STING: A Statistical Information Grid Approach (2)

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical information of each cell is calculated and stored beforehand and is used to answer queriesbeforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell

count, mean, s, min, max

type of distribution—normal, uniform, etc.

Use a top-down approach to answer spatial data queries

Start from a pre-selected layer—typically with a small number of ll


cells

For each cell in the current level compute the confidence interval

STING: A Statistical Information Grid Approach (3)Remove the irrelevant cells from further considerationWhen finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reachedRepeat this process until the bottom layer is reached


STING: A Statistical Information Grid Approach (4)

Advantages:Query-independent, easy to parallelize, incremental

d tupdateO(K), where K is the number of grid cells at the lowest level

Disadvantages:All the cluster boundaries are either horizontal or vertical and no diagonal boundary is detectedvertical, and no diagonal boundary is detected


WaveClusterWaveClusterA multi-resolution clustering approach whichA multi resolution clustering approach which applies wavelet transform to the feature space

A wavelet transform is a signal processing techniqueA wavelet transform is a signal processing technique that decomposes a signal into different frequency sub-band.

Both grid-based and density-based

Input parameters: p p# of grid cells for each dimension

the wavelet, and the # of applications of wavelet


, pptransform.

WaveClusterWaveClusterHow to apply wavelet transform to findHow to apply wavelet transform to find clusters

Summaries the data by imposing a y p gmultidimensional grid structure onto data spaceThese multidimensional spatial data objects are

d i di i l frepresented in an n-dimensional feature spaceApply wavelet transform on feature space to find the dense regions in the feature spacethe dense regions in the feature spaceApply wavelet transform multiple times which result in clusters at different scales from fine to


coarse

Wavelet TransformWavelet Transform

Decomposes a signal into different frequency subbands. (can be applied to n-q y ( ppdimensional signals)Data are transformed to preserve relative pdistance between objects at different levels of resolution.Allows natural clusters to become more distinguishable


What Is Wavelet (2)?What Is Wavelet (2)?


QuantizationQuantization


TransformationTransformation


WaveClusterWaveCluster

Why is wavelet transformation useful forWhy is wavelet transformation useful for clustering

Unsupervised clusteringUnsupervised clustering

It uses hat-shape filters to emphasize region where points cluster but simultaneously towhere points cluster, but simultaneously to suppress weaker information in their boundary



Effective removal of outliers



Multi-resolution

Cost efficiency


Cost efficiency




Major features:Complexity O(N)Complexity O(N)

Detect arbitrary shaped clusters at different scalesscales

Not sensitive to noise, not sensitive to input orderorder

Only applicable to low dimensional data


CLIQUE (Clustering In QUEst)CLIQUE (Clustering In QUEst)

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE b id d b th d it b d d id b dCLIQUE can be considered as both density-based and grid-basedIt partitions each dimension into the same number of equal length interval

It partitions an m-dimensional data space into non-overlapping rectangular p p pp g gunits

A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace


CLIQUE: The Major StepsCLIQUE: The Major StepsPartition the data space and find the number of points that lie inside each cell of the partition.

Identify the subspaces that contain clusters using the Apriori principle

Identify clusters:

Determine dense units in all subspaces of interestsDetermine connected dense units in all subspaces of interests.

Generate minimal description for the clustersDetermine maximal regions that cover a cluster of connected


dense units for each clusterDetermination of minimal cover for each cluster

CLIQUECLIQUE

ary

,000

)

7 7cati

onee

k)

Sal

a(1

0,

54

67

54

6Vac

(we

31

2

31

2

20 30 40 50 60age0

20 30 40 50 60age0


CLIQUECLIQUE

acat

ion = 3

Va

30 50age

30 50


Strength and Weakness of CLIQUECLIQUE

StrengthStrengthIt automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspacesIt is insensitive to the order of records in input and does not presume some canonical data distributionpIt scales linearly with the size of input and has good scalability as the number of dimensions in the data increasesincreases

WeaknessThe accuracy of the clustering result may be degraded


y g y gat the expense of simplicity of the method

Constrained ClusteringConstrained Clustering

Constraints exist in data space or in user pqueries

Example: ATM allocation with bridges and highways

People can cross a highway by a bridge


Clustering With Obstacle Objects


Taking obstacles into accountNot Taking obstacles into account

Outlier AnalysisOutlier Analysis

“One person’s noise is another person’s signal”gOutliers: the objects considerably dissimilar from the remainder of the data

Examples: credit card fraud, Michael Jordon, etcApplications: credit card fraud detection, telecom fraud detection, customer segmentation medical analysis etc


segmentation, medical analysis, etc

Statistical Outlier AnalysisStatistical Outlier Analysis

Discordancy/outlier tests100+ tests proposedp p

Data distributionDistribution parametersDistribution parameters

The number of outliersThe types of expected outliersThe types of expected outliers

Example: upper or lower outliers in an ordered sample


p

Drawbacks of Statistical Approaches

Most tests are univariateUnsuitable for multidimensional datasetsUnsuitable for multidimensional datasets

All are distribution-basedU k di t ib ti i li tiUnknown distributions in many applications


Depth-based MethodsDepth-based Methods

Organize data objects in layers with various depthsdepths

The shallow layers are more likely to contain outliersoutliers

Example: Peeling, Depth contours

C l it O(Nk/2) f k d d t tComplexity O(Nk/2) for k-d datasetsUnacceptable for k>2


Distance-based OutliersDistance-based Outliers

A DB(p, D)-outlier is an object O in a dataset T s t at least fraction p of thedataset T s.t. at least fraction p of the objects in T lies at a distance greater than distance D from Odistance D from O

Algorithms for mining distance-based outliersoutliers

The index-based algorithm, the nested-loop algorithm the cell based algorithm


algorithm, the cell-based algorithm

Index-based AlgorithmsIndex-based Algorithms

Find DB(p, D) outliers in T with n objectsFind an objects having at most n(1-p)j g ( p)neighbors with radius D

AlgorithmgBuild a standard multidimensional indexSearch every object O with radius D

If there are at least n(1-p) neighbors, O is not an outlierElse output O


Else, output O

Pros and Cons of Index-based Algorithms

Complexity of search O(kN2)More scalable with dimensionality than depth-More scalable with dimensionality than depth-based approaches

Building a right index is very costlyBuilding a right index is very costlyIndex building cost renders the index-based algorithms non-competitivealgorithms non-competitive


A Naïve Nested-loop Algorithm

For j=1 to n doSet count =0;Set countj=0;

For k=1 to n do if (dist(j,k)<D) then countj++;

If count <= n(1 p) then output j as an outlier;If countj <= n(1-p) then output j as an outlier;

No explicit index construction( 2)O(N2)

Many database scans


Optimizations of Nested-loop Algorithm

Once an object has at least n(1-p)neighbors with radius D no need to countneighbors with radius D, no need to count further

Use the data in main memory as much asUse the data in main memory as much as possible

R d th b f d t bReduce the number of database scans


References (1)References (1)

R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to , g, g , J S Op O g pidentify the clustering structure, SIGMOD’99.P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases KDD'96clusters in large spatial databases. KDD'96.M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.g, ,D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.A K J i d R C D b Al i h f Cl i D P i i H ll 1988


A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.

References (2)References (2)

L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98VLDB 98.G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.R Ng and J Han Efficient and effective clustering method for spatial data miningR. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.G Sheikholeslami S Chatterjee and A Zhang WaveCluster: A multi resolutionG. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.W. Wang, J. Yang, R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, VLDB’97.T Zhang R Ramakrishnan and M Livny BIRCH : an efficient data clustering method


T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.

Documents

Clustering - cs.unc.edu fileClustering COMP 790COMP 790-90 Research Seminar90 Research Seminar BCB 713 Module Spring 2011 Wei Wang The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL