Upload
vuongbao
View
218
Download
0
Embed Size (px)
Citation preview
ClusteringClusteringCOMP 790-90 Research SeminarCOMP 790 90 Research Seminar
BCB 713 ModuleSpring 2011Wei Wang
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Drawback of Distance-based Methods
Hard to find clusters with irregular shapes
Hard to specify the number of clustersHard to specify the number of clusters
Heuristic: a cluster must be dense
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications2
Directly Density ReachableDirectly Density ReachablepMinPts = 3
ParametersEps: Maximum radius of the neighborhood
qEps = 1 cm
p gMinPts: Minimum number of points in an Eps-neighborhood of that pointNEps(p): {q | dist(p,q) Eps}
Core object p: |Neps(p)|MinPtsPoint q directly density-reachable from p iff q Neps(p) and p is a core object
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications3
Density-Based Clustering: Background (II) p
p1
Density-reachableDirectly density reachable p p p p
qp1
Directly density reachable p1p2, p2p3, …, pn-1 pn pn density-reachable from p1
Density-connectedDensity-connectedPoints p, q are density-reachable from o p and q are density-connectedand q are density-connected
p q
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications4
o
DBSCANDBSCAN
A cluster: a maximal set of density-connected pointsconnected points
Discover clusters of arbitrary shape in spatial databases with noisedatabases with noise
Outlier
Core
BorderEps = 1cm
MinPts = 5
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications5
Core MinPts 5
DBSCAN: the AlgorithmDBSCAN: the Algorithm
Arbitrary select a point pRetrieve all points density-reachable from p wrt p y pEps and MinPtsIf p is a core point, a cluster is formedIf p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the databasepoint of the databaseContinue the process until all of the points have been processed
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications6
p
Problems of DBSCANProblems of DBSCAN
Different clusters may have very different densitiesdensities
Clusters may be in hierarchies
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications7
OPTICS: A Cluster-ordering Method
OPTICS: ordering points to identify the clustering structureclustering structure
“Group” points by density connectivityHi hi f l tHierarchies of clusters
Visualize clusters and the hierarchy
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications8
DENCLUE: Using Density Functions
DENsity-based CLUstEring
Major featuresMajor featuresSolid mathematical foundation
Good for data sets with large amounts of noise
Allow a compact mathematical description of arbitrarily shaped clusters in high-dimensional data sets
Si ifi l f h i i l i h (f hSignificantly faster than existing algorithms (faster than DBSCAN by a factor of up to 45)
But need a large number of parameters
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications9
But need a large number of parameters
Grid-based Clustering Methods
IdeasUsing multi-resolution grid data structuresUsing multi-resolution grid data structures
Use dense grid cells to form clusters
S l i t ti th dSeveral interesting methodsSTING
lWaveCluster
CLIQUE
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications10
STING: A Statistical Information Grid Approach
The spatial area area is divided into rectangular cellsThere are several levels of cells corresponding to different levels of resolution
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications11
STING: A Statistical Information Grid Approach (2)
Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical information of each cell is calculated and stored beforehand and is used to answer queriesbeforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell
count, mean, s, min, max
type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries
Start from a pre-selected layer—typically with a small number of ll
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications12
cells
For each cell in the current level compute the confidence interval
STING: A Statistical Information Grid Approach (3)Remove the irrelevant cells from further considerationWhen finish examining the current layer, proceed to the next lower level Repeat this process until the bottom layer is reachedRepeat this process until the bottom layer is reached
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications13
STING: A Statistical Information Grid Approach (4)
Advantages:Query-independent, easy to parallelize, incremental
d tupdateO(K), where K is the number of grid cells at the lowest level
Disadvantages:All the cluster boundaries are either horizontal or vertical and no diagonal boundary is detectedvertical, and no diagonal boundary is detected
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications14
WaveClusterWaveClusterA multi-resolution clustering approach whichA multi resolution clustering approach which applies wavelet transform to the feature space
A wavelet transform is a signal processing techniqueA wavelet transform is a signal processing technique that decomposes a signal into different frequency sub-band.
Both grid-based and density-based
Input parameters: p p# of grid cells for each dimension
the wavelet, and the # of applications of wavelet
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications15
, pptransform.
WaveClusterWaveClusterHow to apply wavelet transform to findHow to apply wavelet transform to find clusters
Summaries the data by imposing a y p gmultidimensional grid structure onto data spaceThese multidimensional spatial data objects are
d i di i l frepresented in an n-dimensional feature spaceApply wavelet transform on feature space to find the dense regions in the feature spacethe dense regions in the feature spaceApply wavelet transform multiple times which result in clusters at different scales from fine to
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications16
coarse
Wavelet TransformWavelet Transform
Decomposes a signal into different frequency subbands. (can be applied to n-q y ( ppdimensional signals)Data are transformed to preserve relative pdistance between objects at different levels of resolution.Allows natural clusters to become more distinguishable
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications17
What Is Wavelet (2)?What Is Wavelet (2)?
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications18
QuantizationQuantization
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications19
TransformationTransformation
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications20
WaveClusterWaveCluster
Why is wavelet transformation useful forWhy is wavelet transformation useful for clustering
Unsupervised clusteringUnsupervised clustering
It uses hat-shape filters to emphasize region where points cluster but simultaneously towhere points cluster, but simultaneously to suppress weaker information in their boundary
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications21
WaveClusterWaveCluster
Effective removal of outliers
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications22
WaveClusterWaveCluster
Multi-resolution
Cost efficiency
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications23
Cost efficiency
WaveClusterWaveCluster
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications24
WaveClusterWaveCluster
Major features:Complexity O(N)Complexity O(N)
Detect arbitrary shaped clusters at different scalesscales
Not sensitive to noise, not sensitive to input orderorder
Only applicable to low dimensional data
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications25
CLIQUE (Clustering In QUEst)CLIQUE (Clustering In QUEst)
Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
CLIQUE b id d b th d it b d d id b dCLIQUE can be considered as both density-based and grid-basedIt partitions each dimension into the same number of equal length interval
It partitions an m-dimensional data space into non-overlapping rectangular p p pp g gunits
A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications26
CLIQUE: The Major StepsCLIQUE: The Major StepsPartition the data space and find the number of points that lie inside each cell of the partition.
Identify the subspaces that contain clusters using the Apriori principle
Identify clusters:
Determine dense units in all subspaces of interestsDetermine connected dense units in all subspaces of interests.
Generate minimal description for the clustersDetermine maximal regions that cover a cluster of connected
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications27
dense units for each clusterDetermination of minimal cover for each cluster
CLIQUECLIQUE
ary
,000
)
7 7cati
onee
k)
Sal
a(1
0,
54
67
54
6Vac
(we
31
2
31
2
20 30 40 50 60age0
20 30 40 50 60age0
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications28
CLIQUECLIQUE
acat
ion = 3
Va
30 50age
30 50
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications29
Strength and Weakness of CLIQUECLIQUE
StrengthStrengthIt automatically finds subspaces of the highest dimensionality such that high density clusters exist in those subspacesIt is insensitive to the order of records in input and does not presume some canonical data distributionpIt scales linearly with the size of input and has good scalability as the number of dimensions in the data increasesincreases
WeaknessThe accuracy of the clustering result may be degraded
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications30
y g y gat the expense of simplicity of the method
Constrained ClusteringConstrained Clustering
Constraints exist in data space or in user pqueries
Example: ATM allocation with bridges and highways
People can cross a highway by a bridge
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications31
Clustering With Obstacle Objects
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications32
Taking obstacles into accountNot Taking obstacles into account
Outlier AnalysisOutlier Analysis
“One person’s noise is another person’s signal”gOutliers: the objects considerably dissimilar from the remainder of the data
Examples: credit card fraud, Michael Jordon, etcApplications: credit card fraud detection, telecom fraud detection, customer segmentation medical analysis etc
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications33
segmentation, medical analysis, etc
Statistical Outlier AnalysisStatistical Outlier Analysis
Discordancy/outlier tests100+ tests proposedp p
Data distributionDistribution parametersDistribution parameters
The number of outliersThe types of expected outliersThe types of expected outliers
Example: upper or lower outliers in an ordered sample
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications34
p
Drawbacks of Statistical Approaches
Most tests are univariateUnsuitable for multidimensional datasetsUnsuitable for multidimensional datasets
All are distribution-basedU k di t ib ti i li tiUnknown distributions in many applications
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications35
Depth-based MethodsDepth-based Methods
Organize data objects in layers with various depthsdepths
The shallow layers are more likely to contain outliersoutliers
Example: Peeling, Depth contours
C l it O(Nk/2) f k d d t tComplexity O(Nk/2) for k-d datasetsUnacceptable for k>2
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications36
Distance-based OutliersDistance-based Outliers
A DB(p, D)-outlier is an object O in a dataset T s t at least fraction p of thedataset T s.t. at least fraction p of the objects in T lies at a distance greater than distance D from Odistance D from O
Algorithms for mining distance-based outliersoutliers
The index-based algorithm, the nested-loop algorithm the cell based algorithm
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications37
algorithm, the cell-based algorithm
Index-based AlgorithmsIndex-based Algorithms
Find DB(p, D) outliers in T with n objectsFind an objects having at most n(1-p)j g ( p)neighbors with radius D
AlgorithmgBuild a standard multidimensional indexSearch every object O with radius D
If there are at least n(1-p) neighbors, O is not an outlierElse output O
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications38
Else, output O
Pros and Cons of Index-based Algorithms
Complexity of search O(kN2)More scalable with dimensionality than depth-More scalable with dimensionality than depth-based approaches
Building a right index is very costlyBuilding a right index is very costlyIndex building cost renders the index-based algorithms non-competitivealgorithms non-competitive
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications39
A Naïve Nested-loop Algorithm
For j=1 to n doSet count =0;Set countj=0;
For k=1 to n do if (dist(j,k)<D) then countj++;
If count <= n(1 p) then output j as an outlier;If countj <= n(1-p) then output j as an outlier;
No explicit index construction( 2)O(N2)
Many database scans
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications40
Optimizations of Nested-loop Algorithm
Once an object has at least n(1-p)neighbors with radius D no need to countneighbors with radius D, no need to count further
Use the data in main memory as much asUse the data in main memory as much as possible
R d th b f d t bReduce the number of database scans
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications41
References (1)References (1)
R. Agrawal, J. Gehrke, D. Gunopulos, and P. Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. SIGMOD'98M. R. Anderberg. Cluster Analysis for Applications. Academic Press, 1973.M. Ankerst, M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to , g, g , J S Op O g pidentify the clustering structure, SIGMOD’99.P. Arabie, L. J. Hubert, and G. De Soete. Clustering and Classification. World Scientific, 1996M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases KDD'96clusters in large spatial databases. KDD'96.M. Ester, H.-P. Kriegel, and X. Xu. Knowledge discovery in large spatial databases: Focusing techniques for efficient class identification. SSD'95.D. Fisher. Knowledge acquisition via incremental conceptual clustering. Machine Learning, 2:139-172, 1987.g, ,D. Gibson, J. Kleinberg, and P. Raghavan. Clustering categorical data: An approach based on dynamic systems. In Proc. VLDB’98.S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. SIGMOD'98.A K J i d R C D b Al i h f Cl i D P i i H ll 1988
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications42
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Printice Hall, 1988.
References (2)References (2)
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990.E. Knorr and R. Ng. Algorithms for mining distance-based outliers in large datasets. VLDB’98VLDB 98.G. J. McLachlan and K.E. Bkasford. Mixture Models: Inference and Applications to Clustering. John Wiley and Sons, 1988.P. Michaud. Clustering techniques. Future Generation Computer systems, 13, 1997.R Ng and J Han Efficient and effective clustering method for spatial data miningR. Ng and J. Han. Efficient and effective clustering method for spatial data mining. VLDB'94.E. Schikuta. Grid clustering: An efficient hierarchical clustering method for very large data sets. Proc. 1996 Int. Conf. on Pattern Recognition, 101-105.G Sheikholeslami S Chatterjee and A Zhang WaveCluster: A multi resolutionG. Sheikholeslami, S. Chatterjee, and A. Zhang. WaveCluster: A multi-resolution clustering approach for very large spatial databases. VLDB’98.W. Wang, J. Yang, R. Muntz, STING: A Statistical Information Grid Approach to Spatial Data Mining, VLDB’97.T Zhang R Ramakrishnan and M Livny BIRCH : an efficient data clustering method
COMP 790-090 Data Mining: Concepts, Algorithms, and Applications43
T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH : an efficient data clustering method for very large databases. SIGMOD'96.