Upload
simon-roberts
View
217
Download
0
Embed Size (px)
Citation preview
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 22
Clustering OutlineClustering Outline
Goal:Goal: Provide an overview of the clustering Provide an overview of the clustering problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms
Clustering Problem OverviewClustering Problem OverviewClustering TechniquesClustering Techniques–Hierarchical AlgorithmsHierarchical Algorithms–Partitional AlgorithmsPartitional Algorithms–Genetic AlgorithmGenetic Algorithm–Clustering Large DatabasesClustering Large Databases
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 44
General Applications of General Applications of Clustering Clustering
Pattern RecognitionPattern RecognitionSpatial Data Analysis Spatial Data Analysis – create thematic maps in GIS by clustering feature create thematic maps in GIS by clustering feature
spacesspaces– detect spatial clusters and explain them in spatial data detect spatial clusters and explain them in spatial data
miningmining
Image ProcessingImage ProcessingEconomic Science (especially market research)Economic Science (especially market research)WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of similar Cluster Weblog data to discover groups of similar
access patternsaccess patterns
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 55
Examples of Clustering Examples of Clustering ApplicationsApplications
Marketing:Marketing: Help marketers discover distinct groups in Help marketers discover distinct groups in their customer bases, and then use this knowledge to their customer bases, and then use this knowledge to develop targeted marketing programsdevelop targeted marketing programs
Land use:Land use: Identification of areas of similar land use in an Identification of areas of similar land use in an earth observation databaseearth observation database
Insurance:Insurance: Identifying groups of motor insurance policy Identifying groups of motor insurance policy holders with a high average claim costholders with a high average claim cost
City-planning:City-planning: Identifying groups of houses according to Identifying groups of houses according to their house type, value, and geographical locationtheir house type, value, and geographical location
Earth-quake studies:Earth-quake studies: Observed earth quake epicenters Observed earth quake epicenters should be clustered along continent faultsshould be clustered along continent faults
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 66
Clustering vs. ClassificationClustering vs. Classification
No prior knowledgeNo prior knowledge– Number of clustersNumber of clusters– Meaning of clustersMeaning of clusters– Cluster results are dynamicCluster results are dynamic
Unsupervised learningUnsupervised learning
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 77
Classification vs. ClusteringClassification vs. ClusteringClassification: Supervised learning:
Learns a method for predicting the instance class from pre-labeled (classified) instances
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 88
ClusteringClusteringUnsupervised learning:
Finds “natural” grouping of instances given un-labeled data
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 99
Clustering HousesClustering Houses
Size Based
Geographic Distance Based
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1010
Clustering MethodsClustering MethodsMany different method and algorithms:Many different method and algorithms:
–For numeric and/or symbolic dataFor numeric and/or symbolic data
–Deterministic vs. probabilisticDeterministic vs. probabilistic
–Exclusive vs. overlappingExclusive vs. overlapping
–Hierarchical vs. flatHierarchical vs. flat
–Top-down vs. bottom-upTop-down vs. bottom-up
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1111
Clustering IssuesClustering Issues
Outlier handlingOutlier handling
Dynamic dataDynamic data
Interpreting resultsInterpreting results
Evaluating resultsEvaluating results
Number of clustersNumber of clusters
Data to be usedData to be used
ScalabilityScalability
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1212
Impact of Outliers on Impact of Outliers on ClusteringClustering
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1313
Clustering EvaluationClustering Evaluation
Manual inspectionManual inspection
Benchmarking on existing labelsBenchmarking on existing labels
Cluster quality measuresCluster quality measures–distance measuresdistance measures–high similarity within a cluster, low across high similarity within a cluster, low across
clustersclusters
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1414
Data StructuresData Structures
Data matrixData matrix
Dissimilarity matrixDissimilarity matrix
npx...nfx...n1x
...............ipx...ifx...i1x
...............1px...1fx...11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1515
Measure the Quality of ClusteringMeasure the Quality of Clustering
Dissimilarity/Similarity metricDissimilarity/Similarity metric: Similarity is expressed in : Similarity is expressed in terms of a distance function, which is typically metric:terms of a distance function, which is typically metric:
dd((i, ji, j))
There is a separate There is a separate “quality“quality” function that measures the ” function that measures the “goodness“goodness” of a cluster.” of a cluster.
The definitions of distance functions are usually very The definitions of distance functions are usually very different for different for interval-scaled, boolean, categorical, ordinal interval-scaled, boolean, categorical, ordinal and ratio variables.and ratio variables.
Weights should be associated with different variables Weights should be associated with different variables based on applications and data semantics.based on applications and data semantics.
It is hard to define It is hard to define “similar enough” or “good enough”“similar enough” or “good enough” – the answer is typically highly subjective.the answer is typically highly subjective.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1616
Type of data in clustering Type of data in clustering analysisanalysis
Interval-scaled variables:Interval-scaled variables:
Binary variables:Binary variables:
Nominal, ordinal, and ratio variables:Nominal, ordinal, and ratio variables:
Variables of mixed types:Variables of mixed types:
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1717
Similarity and Dissimilarity Similarity and Dissimilarity Between ObjectsBetween Objects
DistancesDistances are normally used to measure the are normally used to measure the
similaritysimilarity or or dissimilaritydissimilarity between two data between two data
objectsobjects
Some popular ones include: Some popular ones include: Minkowski distanceMinkowski distance::
where where ii = ( = (xxi1i1, , xxi2i2, …, , …, xxipip) and) and j j = ( = (xxj1j1, , xxj2j2, …, , …, xxjpjp) are two ) are two
pp-dimensional data objects, and -dimensional data objects, and qq is a positive integer is a positive integer
If If qq = = 11, , dd is is Manhattan distanceManhattan distance
pp
jx
ix
jx
ix
jx
ixjid )||...|||(|),(
2211
||...||||),(2211 pp jxixjxixjxixjid
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1818
Similarity and Dissimilarity Similarity and Dissimilarity Between Objects (Cont.)Between Objects (Cont.)
If qIf q = = 22,, d d is Euclidean distance:is Euclidean distance:
– PropertiesProperties
d(i,j)d(i,j) 0 0
d(i,i)d(i,i) = 0 = 0
d(i,j)d(i,j) = = d(j,i)d(j,i)
d(i,j)d(i,j) d(i,k)d(i,k) + + d(k,j)d(k,j)
)||...|||(|),( 22
22
2
11 pp jx
ix
jx
ix
jx
ixjid
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 1919
Binary VariablesBinary VariablesA contingency table for binary dataA contingency table for binary data
Simple matching coefficient (invariant, if the binary Simple matching coefficient (invariant, if the binary
variable is variable is symmetricsymmetric):):
Jaccard coefficient (noninvariant if the binary variable is Jaccard coefficient (noninvariant if the binary variable is
asymmetricasymmetric): ):
dcbacb jid
),(
pdbcasum
dcdc
baba
sum
0
1
01
cbacb jid
),(
Object i
Object j
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2020
Dissimilarity between Dissimilarity between Binary VariablesBinary Variables
ExampleExample
– gender is a symmetric attributegender is a symmetric attribute– the remaining attributes are asymmetric binarythe remaining attributes are asymmetric binary– let the values Y and P be set to 1, and the value N be set to 0let the values Y and P be set to 1, and the value N be set to 0
Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4
Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N
75.0211
21),(
67.0111
11),(
33.0102
10),(
maryjimd
jimjackd
maryjackd
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2121
Nominal VariablesNominal Variables
A generalization of the binary variable in that it can take A generalization of the binary variable in that it can take
more than 2 states, e.g., red, yellow, blue, greenmore than 2 states, e.g., red, yellow, blue, green
Method 1: Simple matchingMethod 1: Simple matching– mm: # of matches,: # of matches, p p: total # of variables: total # of variables
Method 2: use a large number of binary variablesMethod 2: use a large number of binary variables– creating a new binary variable for each of the creating a new binary variable for each of the MM nominal states nominal states
pmpjid ),(
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2222
Clustering ProblemClustering Problem
Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} of tuples } of tuples and an integer value k, the and an integer value k, the Clustering Clustering ProblemProblem is to define a mapping is to define a mapping f:Df:D{1,..,k}{1,..,k} where each where each ttii is assigned to one cluster is assigned to one cluster KKjj, , 1<=j<=k.1<=j<=k.
A A ClusterCluster, K, Kjj,, contains precisely those contains precisely those tuples mapped to it.tuples mapped to it.Unlike classification problem, clusters are Unlike classification problem, clusters are not known a priori.not known a priori.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2323
Types of Clustering Types of Clustering
HierarchicalHierarchical – Nested set of clusters – Nested set of clusters created.created.Partitional Partitional – One set of clusters created.– One set of clusters created.Incremental Incremental – Each element handled one – Each element handled one at a time.at a time.SimultaneousSimultaneous – All elements handled – All elements handled together.together.Overlapping/Non-overlappingOverlapping/Non-overlapping
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2424
Major Clustering ApproachesMajor Clustering Approaches
Partitioning algorithmsPartitioning algorithms: Construct various partitions and : Construct various partitions and
then evaluate them by some criterionthen evaluate them by some criterion
Hierarchy algorithmsHierarchy algorithms: Create a hierarchical decomposition : Create a hierarchical decomposition
of the set of data (or objects) using some criterionof the set of data (or objects) using some criterion
Density-basedDensity-based:: based on connectivity and density functions based on connectivity and density functions
Grid-basedGrid-based:: based on a multiple-level granularity structure based on a multiple-level granularity structure
Model-basedModel-based:: A model is hypothesized for each of the A model is hypothesized for each of the
clusters and the idea is to find the best fit of that model to clusters and the idea is to find the best fit of that model to
each othereach other
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2525
Clustering ApproachesClustering Approaches
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2626
Cluster ParametersCluster Parameters
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2727
Distance Between ClustersDistance Between Clusters
Single LinkSingle Link: smallest distance between : smallest distance between pointspointsComplete LinkComplete Link:: largest distance between largest distance between pointspointsAverage Link:Average Link: average distance between average distance between pointspointsCentroid:Centroid: distance between centroidsdistance between centroids
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2828
Hierarchical ClusteringHierarchical Clustering
Clusters are created in levels actually creating Clusters are created in levels actually creating sets of clusters at each level.sets of clusters at each level.AgglomerativeAgglomerative– Initially each item in its own clusterInitially each item in its own cluster– Iteratively clusters are merged togetherIteratively clusters are merged together– Bottom UpBottom Up
DivisiveDivisive– Initially all items in one clusterInitially all items in one cluster– Large clusters are successively dividedLarge clusters are successively divided– Top DownTop Down
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 2929
Hierarchical ClusteringHierarchical ClusteringUse distance matrix as clustering criteria. This Use distance matrix as clustering criteria. This method does not require the number of clusters method does not require the number of clusters kk as an input, but needs a termination condition as an input, but needs a termination condition
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
a a b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
agglomerative(AGNES)
divisive(DIANA)
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3030
Hierarchical AlgorithmsHierarchical Algorithms
Single LinkSingle Link
MST Single LinkMST Single Link
Complete LinkComplete Link
Average LinkAverage Link
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3131
DendrogramDendrogramA tree data structure A tree data structure which illustrates which illustrates hierarchical clustering hierarchical clustering techniques.techniques.
Each level shows clusters Each level shows clusters for that level.for that level.– Leaf – individual clustersLeaf – individual clusters– Root – one clusterRoot – one cluster
A cluster at level A cluster at level ii is the is the union of its children union of its children clusters at level clusters at level i+1.i+1.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3232
Levels of ClusteringLevels of Clustering
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3333
Agglomerative ExampleAgglomerative ExampleAA BB CC DD EE
AA 00 11 22 22 33
BB 11 00 22 44 33
CC 22 22 00 11 55
DD 22 44 11 00 33
EE 33 33 55 33 00
BA
E C
D
4
Threshold of
2 3 51
A B C D E
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3434
MST ExampleMST Example
AA BB CC DD EE
AA 00 11 22 22 33
BB 11 00 22 44 33
CC 22 22 00 11 55
DD 22 44 11 00 33
EE 33 33 55 33 00
BA
E C
D
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3535
Agglomerative AlgorithmAgglomerative Algorithm
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3636
Single LinkSingle Link
View all items with links (distances) View all items with links (distances) between them.between them.
Finds maximal connected components Finds maximal connected components in this graph.in this graph.
Two clusters are merged if there is at Two clusters are merged if there is at least one edge which connects them.least one edge which connects them.
Uses threshold distances at each level.Uses threshold distances at each level.
Could be agglomerative or divisive.Could be agglomerative or divisive.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3737
MST Single Link AlgorithmMST Single Link Algorithm
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3838
Single Link ClusteringSingle Link Clustering
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 3939
AGNES (Agglomerative Nesting)AGNES (Agglomerative Nesting)
Implemented in statistical analysis packages, e.g., Implemented in statistical analysis packages, e.g., SplusSplus
Use the Single-Link method and the dissimilarity Use the Single-Link method and the dissimilarity matrix. matrix.
Merge nodes that have the least dissimilarityMerge nodes that have the least dissimilarity
Go on in a non-descending fashionGo on in a non-descending fashion
Eventually all nodes belong to the same clusterEventually all nodes belong to the same cluster
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4040
A Dendrogram Shows How the Clusters are Merged Hierarchically
Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.
A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4141
DIANA (Divisive Analysis)DIANA (Divisive Analysis)
Implemented in statistical analysis packages, e.g., Implemented in statistical analysis packages, e.g., SplusSplus
Inverse order of AGNESInverse order of AGNES
Eventually each node forms a cluster on its ownEventually each node forms a cluster on its own
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4242
Partitional ClusteringPartitional Clustering
NonhierarchicalNonhierarchical
Creates clusters in one step as opposed to Creates clusters in one step as opposed to several steps.several steps.
Since only one set of clusters is output, Since only one set of clusters is output, the user normally has to input the desired the user normally has to input the desired number of clusters, k.number of clusters, k.
Usually deals with static sets.Usually deals with static sets.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4343
Partitioning AlgorithmsPartitioning Algorithms
Partitioning method:Partitioning method: Construct a partition of a database Construct a partition of a database DD of of nn objects into a set of objects into a set of kk clusters clustersGiven a Given a kk, find a partition of , find a partition of k clusters k clusters that optimizes the that optimizes the chosen partitioning criterionchosen partitioning criterion– GlobalGlobal optimal: exhaustively enumerate all partitions optimal: exhaustively enumerate all partitions– Heuristic methods: Heuristic methods: k-meansk-means and and k-medoidsk-medoids
algorithmsalgorithms– k-meansk-means: Each cluster is represented by the center : Each cluster is represented by the center
of the clusterof the cluster– k-medoidsk-medoids or PAM (Partition around medoids): or PAM (Partition around medoids):
Each cluster is represented by one of the objects in Each cluster is represented by one of the objects in the cluster the cluster
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4444
Partitional AlgorithmsPartitional Algorithms
MSTMST
Squared ErrorSquared Error
K-MeansK-Means
Nearest NeighborNearest Neighbor
PAMPAM
BEABEA
GAGA
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4545
MST AlgorithmMST Algorithm
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4646
K-MeansK-MeansInitial set of clusters randomly chosen.Initial set of clusters randomly chosen.
Iteratively, items are moved among sets Iteratively, items are moved among sets of clusters until the desired set is of clusters until the desired set is reached.reached.High degree of similarity among High degree of similarity among elements in a cluster is obtained.elements in a cluster is obtained.
Given a cluster KGiven a cluster Kii={t={ti1i1,t,ti2i2,…,t,…,timim}, the }, the
cluster meancluster mean is m is mii = (1/m)(t = (1/m)(ti1i1 + … + t + … + timim))
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4747
K-Means ExampleK-Means ExampleGiven: {2,4,10,12,3,20,30,11,25}, k=2Given: {2,4,10,12,3,20,30,11,25}, k=2
Randomly assign means: mRandomly assign means: m11=3,m=3,m22=4=4
KK11={2,3}, K={2,3}, K22={4,10,12,20,30,11,25}, ={4,10,12,20,30,11,25}, mm11=2.5,m=2.5,m22=16=16
KK11={2,3,4},K={2,3,4},K22={10,12,20,30,11,25}, m={10,12,20,30,11,25}, m11=3,m=3,m22=18=18
KK11={2,3,4,10},K={2,3,4,10},K22={12,20,30,11,25}, ={12,20,30,11,25}, mm11=4.75,m=4.75,m22=19.6=19.6
KK11={2,3,4,10,11,12},K={2,3,4,10,11,12},K22={20,30,25}, m={20,30,25}, m11=7,m=7,m22=25=25
Stop as the clusters with these means are the Stop as the clusters with these means are the same.same.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4848
The The K-MeansK-Means Clustering Clustering MethodMethod
Given k, the k-means algorithm is implemented in 4 steps:– Partition objects into k nonempty subsets– Compute seed points as the centroids of the
clusters of the current partition. The centroid is the center (mean point) of the cluster.
– Assign each object to the cluster with the nearest seed point.
– Go back to Step 2, stop when no more new assignment.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 4949
Comments on the Comments on the K-MeansK-Means Method MethodStrengthStrength – Relatively efficientRelatively efficient: : OO((tkntkn), where ), where nn is # objects, is # objects, kk is # is #
clusters, and clusters, and t t is # iterations. Normally, is # iterations. Normally, kk, , tt << << nn..– Often terminates at a Often terminates at a local optimumlocal optimum. The . The global optimumglobal optimum
may be found using techniques such as: may be found using techniques such as: deterministic deterministic annealingannealing and and genetic algorithmsgenetic algorithms
WeaknessWeakness– Applicable only when Applicable only when meanmean is defined, then what about is defined, then what about
categorical data?categorical data?– Need to specify Need to specify k, k, the the numbernumber of clusters, in advance of clusters, in advance– Unable to handle noisy data and Unable to handle noisy data and outliersoutliers– Not suitable to discover clusters with Not suitable to discover clusters with non-convex shapesnon-convex shapes
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5050
Nearest NeighborNearest Neighbor
Items are iteratively merged into the Items are iteratively merged into the existing clusters that are closest.existing clusters that are closest.
IncrementalIncremental
Threshold, t, used to determine if items Threshold, t, used to determine if items are added to existing clusters or a new are added to existing clusters or a new cluster is created.cluster is created.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5151
Variations of the Variations of the K-MeansK-Means Method MethodA few variants of the A few variants of the k-meansk-means which differ in which differ in– Selection of the initial Selection of the initial kk means means– Dissimilarity calculationsDissimilarity calculations– Strategies to calculate cluster meansStrategies to calculate cluster means
Handling categorical data: Handling categorical data: k-modesk-modes – Replacing means of clusters with Replacing means of clusters with modesmodes– Using new dissimilarity measures to deal with Using new dissimilarity measures to deal with
categorical objectscategorical objects– Using a Using a frequencyfrequency-based method to update modes of -based method to update modes of
clustersclusters– A mixture of categorical and numerical data: A mixture of categorical and numerical data: k-prototypek-prototype
methodmethod
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5252
TheThe K K--MedoidsMedoids Clustering MethodClustering Method
Find Find representativerepresentative objects, called objects, called medoidsmedoids, in clusters, in clusters
PAMPAM (Partitioning Around Medoids,) (Partitioning Around Medoids,)
– starts from an initial set of medoids and iteratively starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting medoids if it improves the total distance of the resulting clusteringclustering
– PAMPAM works effectively for small data sets, but does not works effectively for small data sets, but does not scale well for large data setsscale well for large data sets
CLARACLARA
CLARANSCLARANS): Randomized sampling Focusing + spatial ): Randomized sampling Focusing + spatial data structuredata structure
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5353
PAM (Partitioning Around Medoids)PAM (Partitioning Around Medoids)
PAM - Use real object to represent the clusterPAM - Use real object to represent the cluster
– Select Select kk representative objects arbitrarily representative objects arbitrarily
– For each pair of non-selected object For each pair of non-selected object hh and selected and selected
object object ii, calculate the total swapping cost , calculate the total swapping cost TCTCihih
– For each pair of For each pair of ii and and hh, ,
If If TCTCihih < 0, < 0, ii is replaced by is replaced by hh
Then assign each non-selected object to the most Then assign each non-selected object to the most
similar representative objectsimilar representative object
– repeat steps 2-3 until there is no changerepeat steps 2-3 until there is no change
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5454
PAMPAM
Partitioning Around Medoids (PAM) Partitioning Around Medoids (PAM) (K-Medoids)(K-Medoids)Handles outliers well.Handles outliers well.Ordering of input does not impact results.Ordering of input does not impact results.Does not scale well.Does not scale well.Each cluster represented by one item, Each cluster represented by one item, called the called the medoid.medoid.Initial set of k medoids randomly chosen.Initial set of k medoids randomly chosen.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5656
PAM AlgorithmPAM Algorithm
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5757
DBSCANDBSCAN
Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with NoiseOutliers will not effect creation of cluster.Outliers will not effect creation of cluster.InputInput– MinPts MinPts – minimum number of points in – minimum number of points in
clustercluster– EpsEps – for each point in cluster there must – for each point in cluster there must
be another point in it less than this distance be another point in it less than this distance away.away.
M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 5858
Density ConceptsDensity Concepts