Upload
chester-morgan
View
333
Download
0
Tags:
Embed Size (px)
Citation preview
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 1
Cluster Analysis(Clustering)
©Jiawei Han and Micheline Kamber
Intelligent Database Systems Research Lab
School of Computing Science
Simon Fraser University, Canada
http://www.cs.sfu.ca
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 2
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 4
General Applications of Clustering
Business/Economic Science (especially marketing research)
Biomedical Informatics Pattern Recognition Spatial Data Analysis
detect spatial clusters and explain them in spatial data mining
WWW Document profiling Cluster log data to discover groups of similar
access patterns
Domain Knowledge is very, very important to the applications of clustering !!
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 5
Specific Examples of Clustering Applications
Marketing: Help marketers discover distinct group profiles of customers, and then use this knowledge to develop targeted marketing programs
Insurance: Identifying distinct group profiles of motor insurance policy holders with a high average claim cost
Biology: Derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into the structures inherent in populations
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 6
What Is Good Clustering?
A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity
The quality of a clustering result depends on both the similarity measure and the clustering method.
The quality of a clustering method is measured by its ability to discover some or all of the hidden patterns.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 7
Issues of Clustering in Data Mining
Scalability (important for Big Data Analysis) Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to
determine input parameters Able to deal with noise and outliers Insensitive to order of input records Able to deal with high dimensionality Incorporation of user-specified constraints Interpretability and usability
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 8
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 9
Data Structures in Clustering
Two data structures Data matrix
n objects × p variables
Dissimilarity matrix n objects × n objects
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 10
Similarity Measurement between Samples
Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function d (i, j), which is typically metric:
The definitions of distance functions are usually very different for interval-scaled (numeric), boolean, categorical, ordinal and ratio variables.
It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.
Sometimes, weights should be associated with different variables based on applications and data semantics.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 11
Type of data in clustering analysis
Interval-scaled variables data has order and equal intervals measured on a linear scale the intervals keep the same importance
throughout the scale
Binary variables
Nominal, ordinal, and ratio variables
Variables of mixed types
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 12
Data normalization(Interval-valued variables)
Data normalization (for each feature f )
1. Calculate the mean:
2. Calculate the mean absolute deviation :
1. Calculate the standardized measurement (z-score)
(zero-mean normalization)
Using mean absolute deviation is more robust to outliers
than using standard deviation for cluster analysis
.)...21
1nffff
xx(xn m
|)|...|||(|121 fnffffff
mxmxmxns
f
fifif s
mx z
zero-mean normalization
))(...)()((1 22
2
2
1
'fnfffff
mxmxmxns f
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 13
Similarity and Dissimilarity Between Data Objects
Distances are normally used to measure the similarity
or dissimilarity between two data objects xi and xj (if data objects are viewed as points in data space)
Some popular ones include: Minkowski distance:
where xi = <xi1, xi2, …, xip> and xj = <xj1, xj2, …, xjp>
are two p-dimensional data points, and q is a positive
integer If q = 1, d is Manhattan distance
q q
jpip
q
ji
q
jijixxxxxxxxd )||...|||(|),(
2211
||...||||),(2211 jpipjijiji xxxxxxxxd
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 14
Similarity and Dissimilarity Between Objects (Cont.)
If q = 2, d is Euclidean distance:
Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)
Also one can use weighted distance, parametric Pearson product moment correlation (or simply correlation r), or other disimilarity measures.
)||...|||(|),( 22
22
2
11 jpipjijijixxxxxxxxd
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 15
Distance Functions for Binary Variables
A contingency table for binary data
a : the # of variables that equal 1 for both objects i and j d : the # of variables that equal 0 for both objects i and j b : the # of variables that equal 1 for object i but that
equal 0 for object j c : the # of variables that equal 0 for object i but that
equal 1 for object j
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 16
A contingency table for binary data
Simple matching coefficient (if the binary variable
is symmetric):
Jaccard coefficient (if the binary variable is
asymmetric):
dcbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
cbacb jid
),(
Object i
Object j
Distance Functions for Binary Variables
cbaa jidjisim
),(1),(
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 17
Dissimilarity between Binary Variables
Example
Gender is a symmetric attribute which is ignored here
The remaining attributes are asymmetric binary Let the values Y and P be set to 1, and the value
N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 18
Dissimilarity between Binary Variables
(Asymmetric Example)
Name Fever Cough Test-1 Test-2 Test-3 Test-4
Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
cba
cbmaryjackd
pdbcasum
dcdc
baba
sum
0
1
01
Object i
Object j
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N
PreprocessingIgnore genderY, P → 1N → 0
Conclusion : The clinical conditions of Jack and Mary are more similar.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 19
Nominal Variables
A generalization of the binary variable in that it can take more than 2 states, e.g., color: red, yellow, blue, …
Dissimilarity between Nominal Variables Method 1: Simple matching
m : # of matches, p : total # of variables
Method 2: use a large number of asymmetric binary variables creating a new binary variable for each of the M nominal states of
a particular variable, e.g. <0, 0, 0, 0, 0, 0, 1> for purple color Method 3: use a small number of additive binary variables
<1, 0, 1> for purple color (in 3-bit RGB representation)
pmpjid ),(
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 20
Dissimilarity between Ordinal Variables
An ordinal variable can be discrete or continuous value The values are ordered in a meaningful sequence, e.g.,
job ranks It can be treated like interval-scaled (for each feature f )
1. Define rank for feature values with M f states
2. Replacing xif (the value of f-th variable of i-th object ) by a
rank
3. Map the range of feature f onto [0, 1] by replacing its corresponding rank by
Compute the dissimilarity using methods for interval-scaled variables
1M1
f
ifif
rz
}...,,{rfif
M1
ifr
)( 10, ifififif
zzrx
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 21
Dissimilarity between Ratio-Scaled Variables
Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,
such as AeBt or Ae-Bt
Alternative methods:
1. treat them like interval-scaled variables — not a good choice! (why? Since the scale may be distorted)
2. apply logarithmic transformation yif = log(xif), and treat
the transformed value like interval-scaled value
3. treat them as continuous ordinal data and treat their ranks as interval-scaled.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 22
Variables of Mixed Types
A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,
ordinal, interval and ratio-scaled variables. One may use a weighted formula to combine their
effects.
If feature f is binary or nominal:dij
(f) = 0 if xif = xjf , or dij(f) = 1 otherwise
If f is interval-based: use the normalized distance If feature f is ordinal or ratio-scaled
compute ranks rif and treat rif as interval-scaled
f
ff
w
jidwjid p
f
p
f
1
1),(
),(
1
1
f
if
Mrz
if
(distance between objects i and j)
)(ififif
zrx
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 23
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 24
Major Clustering Approaches
Partitioning algorithms: Construct a number of partitions and
then improve them iteratively by some optimization criterion
Hierarchy algorithms: Create a hierarchical structure for the
data set using some optimization criterion
Density-based: Based on connectivity and/or density functions
Grid-based: Quantize the data space into a finite number of grid
cells on which cluster analysis is performed
Model-based: A cluster model is hypothesized for each data
cluster, and find the best fit of the data to the given cluster models
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 25
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 26
Partitioning Algorithms: Basic Concept
(Cluster Representation) Partitioning method: Given a database D of n objects, a
partition algo. organizes objects into k partitions. Each represents a cluster
Given a k, find k clusters that optimizes the chosen partitioning criterion (based on similarity function) Global optimal solution: exhaustively enumerate all combinations Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the
center of the cluster during iterative clustering process k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by the representative object of the cluster during iterative clustering process
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 27
Centroid-Based Clustering Technique
(The k-Means Clustering Method )
Given k, the k-Means algorithm is implemented in 4 steps:
1. Randomly select k of the data objects as the seed points. Each seed point represents a cluster.
2. Re-assign each data object to a cluster with the nearest seed point. (n objects, k clusters)
3. Compute new seed points which are the centroids of the clusters at this epoch. The centroid is the center (mean) of a cluster.
4. Go back to Step 2, stop when no more new data assignment happens.
t ite
ratio
ns
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 28
The k-Means Clustering Method
Example : 2-means
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Find the centroids
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Reassign all data
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Stop if stabilized
Find the centroids
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 29
Comments on the k-Means Method
Strength Relatively efficient: O(nkt), where n is # objects, k is #
clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global
optimum may be found using techniques such as: simulated annealing and genetic algorithms
Weakness Applicable only when mean is defined, then what
about categorical data, like colors red, blue, green, …? Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers which may distort
the distribution of data Not suitable to discover clusters with non-convex
shapes
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 30
Variations of the k-Means Method
A few variants of the k-Means which differ in Selection of the k initial seed points Dissimilarity calculations Strategies to calculate cluster means
Handling categorical data: k-modes (Huang’98) Using new dissimilarity measures to deal with
categorical objects Replacing means of clusters with modes Using a frequency-based method to update
modes of clusters A mixture of categorical and numerical data: k-
prototype method
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 31
Representative Object-Based Technique
(The K-Medoids Clustering Method)
Find representative objects, called medoids, in clusters Medoid : the most centrally located data object of a cluster k-Means is sensitive to noisy data and outliers.
PAM (Partitioning Around Medoids, 1987) PAM works effectively for small data sets, but does not
scale well for large data sets : O( C(n,k) ) Improved algorithms :
CLARA (Kaufmann & Rousseeuw, 1990): Use sampling to find representatives of the data (in order to reduce data size)
CLARANS (Ng & Han, 1994): Use different samples for representatives searching at each run. Return the best representative set in a user-defined number of runs.
Centroid-Based
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 32
PAM (Partitioning Around Medoids) (1987)
PAM (Kaufman and Rousseeuw, 1987), built in S+ Use representative objects of a cluster as seed points.
1. Select k representative objects arbitrarily
2. Assign each non-medoid object to the most similar
representative object (n objects, k clusters)
3. Do a greedy search O(k(n-k)2) : for each pair of non-medoid
object h and medoid object i, calculate the total swapping
cost TCih (after cluster reassignment). Find the best
TCihBest .
If TCihBest
= (DTBest - DT)< 0 (where ),
then medoid object i is replaced by h, otherwise stop.
4. repeat steps 3 until there is no change
k
jCp
jT
j
opD1
||
t ite
ratio
ns
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 33
CLARA (Clustering Large Applications) (1990)
CLARA (Kaufmann and Rousseeuw in 1990)
Built in statistical analysis packages, such as S+ It draws a sample set of the data, applies PAM on
the samples, and gives the best clustering as the output
Strength: deals with larger data sets than PAM Weakness:
Efficiency depends on the sample size If the samples are biased, a good clustering based
on the samples will not necessarily represent a good clustering for the whole data set
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 34
CLARANS (“Randomized” CLARA) (1994)
(Clustering Large Application based on Randomized Search-Ng & Han’94) The PAM clustering process can be presented as searching a graph
where every node (a set of k medoids) is a potential solution : O(C(n,k)). At each step, all the neighbors of the current node are examined. The
current node is replaced by the neighbor with the best improvement.(Two nodes are neighbors if their sets differ by only one object. Every node has k(n-k)-1 neghbors.)
CLARA confine to a fixed sample of size m at each search stage. CLARANS dynamically draws a random sample of neighbors in each
search step. If a local optimum is found, CLARANS starts with a new sample of
neighbors in search for a new local optimum. Once a user-specified number of local minima has been found, CLARANS outputs the best one.
It is more efficient and scalable than both PAM and CLARA
mn
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 35
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 36
Hierarchical Clustering
Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but may need a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 37
AGNES (Agglomerative Nesting)
Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,
S+ Use the Single-Link method and the dissimilarity
matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
… …
Organize data objects into a tree structure of hierarchical partitions (tree of clusters), called a dendrogram.
A clustering result of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 38
Dendrogram (Shows How Clusters Merge Hierarchically)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 39
Clustering Dendrogram Clustering Dendrogram in Gene Expression Analysisin Gene Expression Analysis
Finding differentially regulated genes
Clustering
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 40
DIANA (Divisive Analysis)
Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,
S+ Inverse order of AGNES
Eventually each node (data object) forms a cluster on its own
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 100
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 41
More on Hierarchical Clustering Methods Weakness of agglomerative clustering methods
do not scale well: time complexity of at least O(n2), where n is the number of data objects
can never undo what was done previously at each stage
Integration of hierarchical and distance-based clusterings BIRCH (1996): uses CF-tree and incrementally adjusts
the quality of sub-clusters (good for Big Data Analysis) CURE (1998): selects well-scattered points from each
cluster and then shrinks them towards the center of the cluster by a specified fraction
CHAMELEON (1999): hierarchical clustering using dynamic modeling
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 42
BIRCH (1996)
Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)
Incrementally construct a CF-tree (Clustering Feature Tree), a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF-tree (a
hierarchical compression structure of the data that tries to preserve the inherent clustering structure of the data)
Phase 2: use an arbitrary (hierarchical) clustering algo. to cluster the data in each leaf node of the CF-tree
Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans(A good feature for big data analysis)
Weakness: handles only numeric data, and sensitive to the input order of the data records.
Example
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 43
Clustering Feature of a sub-cluster: CF = (N, LS, SS)
N: Number of data vectors in the sub-cluster
LS: Ni=1Xi (linear sum)
SS: Ni=1Xi
2 (square sum)
BIRCH’s Clustering Feature Vector
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
CF = (5, (16,30),(54,190))
(3,4)(2,6)(4,5)(4,7)(3,8)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 44
BIRCH’s Parameters & Their Meanings
A clustering feature is the summary statistics for the data in a sub-cluster The zero moment : the number of data vectors in a sub-
cluster The first moment : the linear sum on N data vectors The second moment : the square sum on N data vectors
BIRCH use two parameters for controlling node split in the construction process of a CF tree Branching factor : constrains the maximum number of child
nodes per non-leaf node (a leaf node is required to fit in a memory page of size P which determines the branching factor of a leaf node)
Threshold : constrains the maximum average distance of data pairs of a sub-cluster in the leaf nodes
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 45
CF TreeCF1
child1
CF3
child3
CF2
child2
CF6
child6
CF1
child1
CF3
child3
CF2
child2
CF5
child5
CF1 CF2 CF6prev next CF1 CF2 CF4
prev next
Branching factor = 7Threshold = 6
Root
Non-leaf node
Leaf node Leaf node
An object is inserted into the closed leaf entry. Split the node as needed
After an insertion, information about it is passed toward the root of tree
Similar to the construction of B+-tree
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 46
CURE (Clustering Using REpresentatives )
CURE: proposed by Guha, Rastogi & Shim, 1998A type of hierarchical clustering. Most clustering algorithms either favors clusters with
spherical shape and similar sizes, or are fragile in the present of outliers. CURE overcomes these problems.
Uses multiple representative points to represent a cluster. At each step of clustering, the two clusters with the closest pair of representative points are merged.
Adjusts well to arbitrary shaped clusters. The representative points of a cluster attempt to capture the data distribution and shape of the cluster.
Stops the creation of a cluster hierarchy until there are only k remaining clusters.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 47
The Steps for large data set in CURE
Steps involved in clustering large data set using
CURE
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 48
Cure: The Algorithm(for large data set in CURE )
1. Phase 1 : Pre-clustering
a. Draw a random sample S of the original data
(if the size of the original data is very large)
b. Partition sample S into p partitions
(Partitioning for speedup)
c. Partially cluster each partition into |S|/(p╳q) clusters
(By CURE hierarchical clustering: starts with each input point as a separate cluster for some q>1)
d. Outlier elimination Most outliers are eliminated by random sampling (Step a)
If a cluster grows too slow, eliminate it.
1/2
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 49
Cure: The Algorithm
2. Phase 2 : find the final clusters (By CURE)
a. Cluster partial clusters (# of partial clusters |S|/q) Find well scattered representative points for the new cluster at
each iteration. The representative points of a cluster are used to compute its distance from other clusters.
The distance between two clusters is the distance between the closest pair of representative points. Merge the closest pair of clusters.
The representative points falling in each newly formed cluster are “shrinked” or moved toward the cluster center by a user-specified fraction .
The representative points capture the shape of the cluster
b. Mark the data with the corresponding cluster labels
2/2
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 50
Data Partitioning and Clustering
Sample size : s = 50 # of partitions : p = 2
x x
x
y
y y
y
x
y
x
Partition size : s/p = 25
cluster #: s/pq = 5
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 51
Representative Points Shrinking in Cure
Shrink the multiple representative points towards the gravity center by a fraction of .
Multiple representatives capture the shape of the cluster
x
y
x
y
Shrink
representatives
old cluster centernew cluster center
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 52
CHAMELEON
CHAMELEON: A type of hierarchical clustering, by G. Karypis, E.H. Han and V. Kumar’99
Measures the similarity based on a dynamic model Two clusters are merged only if the
interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters
A two phase algorithm 1. Use a graph partitioning algorithm: cluster
objects into a large number of relatively small sub-clusters
2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 53
Overall Framework of CHAMELEON
Construct
Sparse Graph
Partition the Graph
By cutting long edges
Merge Partition
Final Clusters
Data Set
K-nearest neighbor graph
According to relative
interconnectivity and
relative closeness
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 54
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 55
Density-Based Clustering Methods
Clustering based on density (local cluster criterion, such as density-connected points)
Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination
condition Several interesting studies:
DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 56
Density-Based Clustering: Background (I)
Two density parameters: ε : maximum neighborhood radius of a core point MinPts: min. number of points in an ε-
neighborhood of a core point Nε(p) :{q belongs to D | dist(p,q) <= ε}
D : data space, Nε(p) : neighbors of point p Directly density-reachable: A point p is directly
density-reachable from a point q wrt. ε, MinPts if
1) p belongs to Nε(q) 2) |Nε(q)| >= MinPts
(q is a core point)
pq
MinPts = 5ε = 1 cm
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 57
Density-Based Clustering: Background (II)
Density-reachable: A point p is density-reachable
from a point q wrt. ε, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi
Density-connected A point p is density-connected to
a point q wrt. ε, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts.
p
qp1
p q
o
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 58
DBSCAN: Density Based Spatial Clustering of Applications with Noise
Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points
If the ε-neighborhood of an object contains at least MinPts of objects, then the object is called a core object
core point
border point
Outlier
ε = 1cm
MinPts = 5
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 59
DBSCAN: The Algorithm
Arbitrary select a point p
Retrieve all points density-reachable from p wrt ε and MinPts.
If p is a core point, a cluster is formed.
If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.
Continue the process until all of the points have been processed.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 60
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 61
Grid-Based Clustering Method
Using multi-resolution grid data structure Several interesting methods
STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)
WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)
A multi-resolution clustering approach using wavelet method
CLIQUE: Agrawal, et al. (SIGMOD’98)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 62
STING: A Statistical Information Grid Approach (1)
Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to
different levels of resolution (1→4→16→64→‥‥)
Each cell at a high level is partitioned into a number of smaller cells in the next lower level
Statistical info of each cell is calculated and stored beforehand and is used to answer queries
Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution—normal, uniform, etc.
Use a top-down approach to answer spatial data queries Start from a pre-selected layer—typically with a small
number of cells For each cell in the current level compute the confidence
interval
STING: A Statistical Information Grid Approach (2)
Remove the irrelevant cells from further consideration
When finish examining the current layer, proceed to the next lower level
Repeat this process until the bottom layer is reached
Advantages: Query-independent, easy to parallelize,
incremental update O(K), where K is the number of grid cells at the
lowest level Disadvantages:
All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected
STING: A Statistical Information Grid Approach (3)
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 65
CLIQUE (Clustering In QUEst)(Clustering for High-Dimension Space)
Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).
Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space
CLIQUE can be considered as both density-based and grid-based
It partitions each dimension into the same number of equal length units
A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter
A cluster is a maximal set of connected dense units within a subspace of higher dimensionality
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 66
CLIQUE: The Major Steps
Identify dense units of higher dimensionality : Partition the data space into units and find the
number of points that lie inside each unit. Determine dense units in all 1-D dimensions of
interests. Identify dense units of higher dimensionality
using the Apriori principle (if a k-D unit is dense, then its (k-1)-D projection units must be dense)
Generate a minimal description for each cluster Determine the maximal region that cover a set of
connected dense units for the cluster Determination of minimal cover for each cluster
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 67
Sala
ry
(10,
000)
20 30 40 50 60age
54
31
26
70
20 30 40 50 60age
54
31
26
70
Vac
atio
n(w
eek)
age
Vac
atio
n
Salary 30 50
= 3
maximal region
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 68
Strength and Weakness of CLIQUE
Strength It automatically finds subspaces of the highest
dimensionality s.t. high density clusters exist in those subspaces
It is insensitive to the order of records in input and does not presume some canonical data distribution
It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases
Weakness The accuracy of the clustering result may be
degraded at the expense of simplicity of the method
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 69
Chapter 7. Cluster Analysis
What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 70
Model-Based Clustering Methods
Attempt to optimize the fit between the data and a given cluster models
Conceptual clustering (a statistical and AI approach) Clustering is performed first, followed by characterization
Produces a classification scheme for a set of unlabeled objects
Finds a conceptual characteristic description for each class/cluster
COBWEB (Fisher’87) A popular and simple method of incremental conceptual clustering
Creates a hierarchical clustering in the form of a classification tree
Each node denotes a concept and contains a probabilistic description of that concept
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 71
COBWEB Clustering Method
A classification treeanimalP(C0) = 1. 0P(scales|C0) = 0.25. . .
fishP(C1) = 0.25P(scales|C1) = 1. 0. . .
amphibianP(C2) = 0.25P(moist|C2) = 1. 0. . .
Mammal/birdP(C3) = 0.5P(hair|C3) = 0.5. . .
MammalP(C4) = 0.25P(hair|C4) = 1. 0. . .
birdP(C5) = 0.25P(feathers|C5) = 1. 0. . .
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 72
Other Model-Based Clustering Methods
Neural network approaches Represent each cluster as an exemplar, acting as a
“prototype” of the cluster In the clustering process, an input data object is
assigned to a cluster whose exemplar is the most similar to the data object according to some distance measure
They are based on competitive learning Involving an organized architecture of neuron units The neuron units compete in a “winner-takes-all”
fashion for current input training data object
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 73
Self-Organizing Feature Maps (SOMs)
(A Neural Network Approach)
Example Clustering is performed by having output
neurons competing for the current input data The neuron whose connection weight vector
is closest to the current training object wins The winner and its neighbors learn by having
their connection weights adjusted SOMs are believed to resemble the
processing that can occur in the human brain Also useful for visualizing high-dimensional
data mapped into 2- or 3-D space
Example
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 74
An Example of SOM Architecture
An example of SOM network• 2 input neurons• 16 output neurons (clusters)• Connection weight matrix : 16╳2• Weight vector of neuron i : <wi1 , wi2>
• Format for input X : <x1 , x2>
Input neurons
Output neurons(4╳4 architecture)
Connections
neuron i
SOM Output Map Example
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 75
Examples of SOM Feature Map
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 76
Mathematical Model of An SOM Example(A 3-input, 25-output SOM)
12 14 1513
1
11
2 3 4 5
6 7 8 9 10
17 19 201816
22 24 252321
InputConnection weights
and competitive layer
a25╳1 = compet (W25╳3X3╳1)
125325
13
253
X n a
125
Feature Map (Output)
x1
x2
x3
a1
ai
a25
.
.
.
.
.
.
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 77
Weights Update & Neighborhood in SOM
Update weight vectors in a neighborhood of the winning neuron i*
N 13 1 8 12 13 14 18 =
N 13 2 3 7 8 9 11 12 13 14 15 17 18 19 23 =
- - - - -
12 14 1513
1
11
2 3 4 5
6 7 8 9 10
17 19 201816
22 24 252321
12 14 1513
1
11
2 3 4 5
6 7 8 9 10
17 19 201816
22 24 252321
)1(13N )2(13N
,))()(()()1( pWpXpWpW iii i Ni d
• X(p) : the training vector at iteration p
}|{)(,** ddjdNjii
• Example: 13 is the winning neuron
•
•
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 78
Step 1Step 1:: InitialisationInitialisation..
Set initial connection weights to small random Set initial connection weights to small random values, say in an interval [0, 1], and assign a small values, say in an interval [0, 1], and assign a small positive value to the positive value to the learning rate parameter learning rate parameter ..
SOM Learning Algorithm
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 79
Step 2Step 2:: Find the winning neuron (iteration p) (Activation and Similarity Matching)
Activate the Kohonen network by applying an input vector Activate the Kohonen network by applying an input vector XX, , and find the winning (and find the winning (best matchingbest matching) neuron ) neuron i *, using the , using the criterion of minimal Euclidean distance criterion of minimal Euclidean distance as follows:as follows:
where , where , nn is the # of input neurons, and is the # of input neurons, and mm is the # of neurons in the Kohonen layer.is the # of neurons in the Kohonen layer.
2/1
1
2* ][ )()()(
n
jijj
m
ii
m
ipwxminpminpi WX
SOM Learning Algorithm
nxxxX ...,, ,21
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 80
Step 3Step 3:: Learning.
Update the connection weights between input neuron Update the connection weights between input neuron jj and and output neuron output neuron i :i :
where where wwijij((pp) is the weight correction at iteration ) is the weight correction at iteration pp..
The weight correction is determined by the competitive learning The weight correction is determined by the competitive learning rule:rule:
where where • is the is the learning ratelearning rate parameter, and parameter, and • NNi*i*(d, (d, pp)) is the is the neighbourhood functionneighbourhood function centred around the centred around the
winning neuron winning neuron i * at a distance at a distance dd at iteration at iteration pp..
)()()1( pwpwpw ijijij
)p,d(Ni,
)p,d(Ni,)p(wx)p(w
*
*
i
iijj
ij
][ 0
SOM Learning Algorithm
1≤ i≤m, 1≤ j≤n
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 81
Step 4Step 4:: IterationIteration..
Increase iteration Increase iteration pp by one, go back to Step 2 and by one, go back to Step 2 and continue until continue until the criterion ofthe criterion of minimal Euclidean distanceminimal Euclidean distance is is
satisfied, or satisfied, or no noticeable changes occur in the feature map.no noticeable changes occur in the feature map.
SOM Learning Algorithm
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 82
The overall effect of the SOM learning rule resides in The overall effect of the SOM learning rule resides in
moving the connection weight vector moving the connection weight vector Wi of the trained of the trained
output neuron output neuron ii towards the input pattern towards the input pattern X. .
Learning in Kohonen Network
X
X- Wi
ΔWi = α (X- Wi)
Wi+ΔWiinput patterninput pattern
Wi
connection weight of neuron connection weight of neuron ii
))()(()(),()()1( pWpXpWpWpWpW iiiii
-Wi
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 83
Convergence Process of SOM
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0.4
0.6
0.8
1
100 iterations
1000 iterations
2000 iterations
5000 iterations
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 84
What Is Outlier Discovery?
What are outliers? The set of objects are considerably dissimilar
from the remainder of the data Example: Sports: Michael Jordan, ...
Problem Find top n outlier points
Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 85
Outlier Discovery: Statistical Approaches
Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on
data distribution distribution parameter (e.g., mean, variance) number of expected outliers
Drawbacks most tests are for single attribute In many cases, data distribution may not be known
Statistical Approaches
Outlier Discovery: Distance-Based Approach
Introduced to counter the main limitations imposed by statistical methods We need multi-dimensional analysis without
knowing data distribution. Distance-based outlier: A DB(p, D)-outlier is an
object O in a dataset T s.t. at least a fraction p of the objects in T lies at a distance greater than D from O
Algorithms for mining distance-based outliers Index-based algorithm Nested-loop algorithm Cell-based algorithm
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 87
Outlier Discovery: Deviation-Based Approach
Identifies outliers by examining the main characteristics of objects in a group
Objects that “deviate” from this description are considered outliers
sequential exception technique simulates the way in which humans can
distinguish unusual objects from a series of similar objects
OLAP data cube technique uses data cubes to identify regions of
anomalies in a large multi-dimensional data set
二〇二三年四月二十一日 Data Mining: Concepts and Techniques 88
Summary
Cluster analysis groups data objects based on their similarity and has wide applications
Measure of similarity can be computed for various types of data
Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods
Outlier detection and analysis are very useful for fraud detection and can be performed by statistical, distance-based or deviation-based approaches
There are still lots of research issues on cluster analysis, such as constraint-based clustering