Data Mining Algorithms Clustering. M.Vijayalakshmi VESIT BE(IT) Data Mining2 Clustering Outline Goal: Provide an overview of the clustering problem and

Data Mining AlgorithmsData Mining Algorithms

ClusteringClustering

M.Vijayalakshmi VESIT BE(IT) Data MiningM.Vijayalakshmi VESIT BE(IT) Data Mining 22

Clustering OutlineClustering Outline

Goal:Goal: Provide an overview of the clustering Provide an overview of the clustering problem and introduce some of the basic problem and introduce some of the basic algorithmsalgorithms

Clustering Problem OverviewClustering Problem OverviewClustering TechniquesClustering Techniques–Hierarchical AlgorithmsHierarchical Algorithms–Partitional AlgorithmsPartitional Algorithms–Genetic AlgorithmGenetic Algorithm–Clustering Large DatabasesClustering Large Databases


General Applications of General Applications of Clustering Clustering

Pattern RecognitionPattern RecognitionSpatial Data Analysis Spatial Data Analysis – create thematic maps in GIS by clustering feature create thematic maps in GIS by clustering feature

spacesspaces– detect spatial clusters and explain them in spatial data detect spatial clusters and explain them in spatial data

miningmining

Image ProcessingImage ProcessingEconomic Science (especially market research)Economic Science (especially market research)WWWWWW– Document classificationDocument classification– Cluster Weblog data to discover groups of similar Cluster Weblog data to discover groups of similar

access patternsaccess patterns


Examples of Clustering Examples of Clustering ApplicationsApplications

Marketing:Marketing: Help marketers discover distinct groups in Help marketers discover distinct groups in their customer bases, and then use this knowledge to their customer bases, and then use this knowledge to develop targeted marketing programsdevelop targeted marketing programs

Land use:Land use: Identification of areas of similar land use in an Identification of areas of similar land use in an earth observation databaseearth observation database

Insurance:Insurance: Identifying groups of motor insurance policy Identifying groups of motor insurance policy holders with a high average claim costholders with a high average claim cost

City-planning:City-planning: Identifying groups of houses according to Identifying groups of houses according to their house type, value, and geographical locationtheir house type, value, and geographical location

Earth-quake studies:Earth-quake studies: Observed earth quake epicenters Observed earth quake epicenters should be clustered along continent faultsshould be clustered along continent faults


Clustering vs. ClassificationClustering vs. Classification

No prior knowledgeNo prior knowledge– Number of clustersNumber of clusters– Meaning of clustersMeaning of clusters– Cluster results are dynamicCluster results are dynamic

Unsupervised learningUnsupervised learning


Classification vs. ClusteringClassification vs. ClusteringClassification: Supervised learning:

Learns a method for predicting the instance class from pre-labeled (classified) instances


ClusteringClusteringUnsupervised learning:

Finds “natural” grouping of instances given un-labeled data


Clustering HousesClustering Houses

Size Based

Geographic Distance Based


Clustering MethodsClustering MethodsMany different method and algorithms:Many different method and algorithms:

–For numeric and/or symbolic dataFor numeric and/or symbolic data

–Deterministic vs. probabilisticDeterministic vs. probabilistic

–Exclusive vs. overlappingExclusive vs. overlapping

–Hierarchical vs. flatHierarchical vs. flat

–Top-down vs. bottom-upTop-down vs. bottom-up


Clustering IssuesClustering Issues

Outlier handlingOutlier handling

Dynamic dataDynamic data

Interpreting resultsInterpreting results

Evaluating resultsEvaluating results

Number of clustersNumber of clusters

Data to be usedData to be used

ScalabilityScalability


Impact of Outliers on Impact of Outliers on ClusteringClustering


Clustering EvaluationClustering Evaluation

Manual inspectionManual inspection

Benchmarking on existing labelsBenchmarking on existing labels

Cluster quality measuresCluster quality measures–distance measuresdistance measures–high similarity within a cluster, low across high similarity within a cluster, low across

clustersclusters


Data StructuresData Structures

Data matrixData matrix

Dissimilarity matrixDissimilarity matrix

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0


Measure the Quality of ClusteringMeasure the Quality of Clustering

Dissimilarity/Similarity metricDissimilarity/Similarity metric: Similarity is expressed in : Similarity is expressed in terms of a distance function, which is typically metric:terms of a distance function, which is typically metric:

dd((i, ji, j))

There is a separate There is a separate “quality“quality” function that measures the ” function that measures the “goodness“goodness” of a cluster.” of a cluster.

The definitions of distance functions are usually very The definitions of distance functions are usually very different for different for interval-scaled, boolean, categorical, ordinal interval-scaled, boolean, categorical, ordinal and ratio variables.and ratio variables.

Weights should be associated with different variables Weights should be associated with different variables based on applications and data semantics.based on applications and data semantics.

It is hard to define It is hard to define “similar enough” or “good enough”“similar enough” or “good enough” – the answer is typically highly subjective.the answer is typically highly subjective.


Type of data in clustering Type of data in clustering analysisanalysis

Interval-scaled variables:Interval-scaled variables:

Binary variables:Binary variables:

Nominal, ordinal, and ratio variables:Nominal, ordinal, and ratio variables:

Variables of mixed types:Variables of mixed types:


Similarity and Dissimilarity Similarity and Dissimilarity Between ObjectsBetween Objects

DistancesDistances are normally used to measure the are normally used to measure the

similaritysimilarity or or dissimilaritydissimilarity between two data between two data

objectsobjects

Some popular ones include: Some popular ones include: Minkowski distanceMinkowski distance::

where where ii = ( = (xxi1i1, , xxi2i2, …, , …, xxipip) and) and j j = ( = (xxj1j1, , xxj2j2, …, , …, xxjpjp) are two ) are two

pp-dimensional data objects, and -dimensional data objects, and qq is a positive integer is a positive integer

If If qq = = 11, , dd is is Manhattan distanceManhattan distance

qq

pp

qq

jx

ix

jx

ix

jx

ixjid )||...|||(|),(

2211

||...||||),(2211 pp jxixjxixjxixjid


Similarity and Dissimilarity Similarity and Dissimilarity Between Objects (Cont.)Between Objects (Cont.)

If qIf q = = 22,, d d is Euclidean distance:is Euclidean distance:

– PropertiesProperties

d(i,j)d(i,j) 0 0

d(i,i)d(i,i) = 0 = 0

d(i,j)d(i,j) = = d(j,i)d(j,i)

d(i,j)d(i,j) d(i,k)d(i,k) + + d(k,j)d(k,j)

)||...|||(|),( 22

22

2

11 pp jx

ix

jx

ix

jx

ixjid


Binary VariablesBinary VariablesA contingency table for binary dataA contingency table for binary data

Simple matching coefficient (invariant, if the binary Simple matching coefficient (invariant, if the binary

variable is variable is symmetricsymmetric):):

Jaccard coefficient (noninvariant if the binary variable is Jaccard coefficient (noninvariant if the binary variable is

asymmetricasymmetric): ):

dcbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

cbacb jid

),(

Object i

Object j


Dissimilarity between Dissimilarity between Binary VariablesBinary Variables

ExampleExample

– gender is a symmetric attributegender is a symmetric attribute– the remaining attributes are asymmetric binarythe remaining attributes are asymmetric binary– let the values Y and P be set to 1, and the value N be set to 0let the values Y and P be set to 1, and the value N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N NMary F Y N P N P NJim M Y P N N N N

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

maryjackd


Nominal VariablesNominal Variables

A generalization of the binary variable in that it can take A generalization of the binary variable in that it can take

more than 2 states, e.g., red, yellow, blue, greenmore than 2 states, e.g., red, yellow, blue, green

Method 1: Simple matchingMethod 1: Simple matching– mm: # of matches,: # of matches, p p: total # of variables: total # of variables

Method 2: use a large number of binary variablesMethod 2: use a large number of binary variables– creating a new binary variable for each of the creating a new binary variable for each of the MM nominal states nominal states

pmpjid ),(


Clustering ProblemClustering Problem

Given a database D={tGiven a database D={t11,t,t22,…,t,…,tnn} of tuples } of tuples and an integer value k, the and an integer value k, the Clustering Clustering ProblemProblem is to define a mapping is to define a mapping f:Df:D{1,..,k}{1,..,k} where each where each ttii is assigned to one cluster is assigned to one cluster KKjj, , 1<=j<=k.1<=j<=k.

A A ClusterCluster, K, Kjj,, contains precisely those contains precisely those tuples mapped to it.tuples mapped to it.Unlike classification problem, clusters are Unlike classification problem, clusters are not known a priori.not known a priori.


Types of Clustering Types of Clustering

HierarchicalHierarchical – Nested set of clusters – Nested set of clusters created.created.Partitional Partitional – One set of clusters created.– One set of clusters created.Incremental Incremental – Each element handled one – Each element handled one at a time.at a time.SimultaneousSimultaneous – All elements handled – All elements handled together.together.Overlapping/Non-overlappingOverlapping/Non-overlapping


Major Clustering ApproachesMajor Clustering Approaches

Partitioning algorithmsPartitioning algorithms: Construct various partitions and : Construct various partitions and

then evaluate them by some criterionthen evaluate them by some criterion

Hierarchy algorithmsHierarchy algorithms: Create a hierarchical decomposition : Create a hierarchical decomposition

of the set of data (or objects) using some criterionof the set of data (or objects) using some criterion

Density-basedDensity-based:: based on connectivity and density functions based on connectivity and density functions

Grid-basedGrid-based:: based on a multiple-level granularity structure based on a multiple-level granularity structure

Model-basedModel-based:: A model is hypothesized for each of the A model is hypothesized for each of the

clusters and the idea is to find the best fit of that model to clusters and the idea is to find the best fit of that model to

each othereach other


Clustering ApproachesClustering Approaches

Clustering

Hierarchical Partitional Categorical Large DB

Agglomerative Divisive Sampling Compression


Cluster ParametersCluster Parameters


Distance Between ClustersDistance Between Clusters

Single LinkSingle Link: smallest distance between : smallest distance between pointspointsComplete LinkComplete Link:: largest distance between largest distance between pointspointsAverage Link:Average Link: average distance between average distance between pointspointsCentroid:Centroid: distance between centroidsdistance between centroids


Hierarchical ClusteringHierarchical Clustering

Clusters are created in levels actually creating Clusters are created in levels actually creating sets of clusters at each level.sets of clusters at each level.AgglomerativeAgglomerative– Initially each item in its own clusterInitially each item in its own cluster– Iteratively clusters are merged togetherIteratively clusters are merged together– Bottom UpBottom Up

DivisiveDivisive– Initially all items in one clusterInitially all items in one cluster– Large clusters are successively dividedLarge clusters are successively divided– Top DownTop Down


Hierarchical ClusteringHierarchical ClusteringUse distance matrix as clustering criteria. This Use distance matrix as clustering criteria. This method does not require the number of clusters method does not require the number of clusters kk as an input, but needs a termination condition as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)


Hierarchical AlgorithmsHierarchical Algorithms

Single LinkSingle Link

MST Single LinkMST Single Link

Complete LinkComplete Link

Average LinkAverage Link


DendrogramDendrogramA tree data structure A tree data structure which illustrates which illustrates hierarchical clustering hierarchical clustering techniques.techniques.

Each level shows clusters Each level shows clusters for that level.for that level.– Leaf – individual clustersLeaf – individual clusters– Root – one clusterRoot – one cluster

A cluster at level A cluster at level ii is the is the union of its children union of its children clusters at level clusters at level i+1.i+1.


Levels of ClusteringLevels of Clustering


Agglomerative ExampleAgglomerative ExampleAA BB CC DD EE

AA 00 11 22 22 33

BB 11 00 22 44 33

CC 22 22 00 11 55

DD 22 44 11 00 33

EE 33 33 55 33 00

BA

E C

D

4

Threshold of

2 3 51

A B C D E


MST ExampleMST Example

AA BB CC DD EE

AA 00 11 22 22 33

BB 11 00 22 44 33

CC 22 22 00 11 55

DD 22 44 11 00 33

EE 33 33 55 33 00

BA

E C

D


Agglomerative AlgorithmAgglomerative Algorithm


Single LinkSingle Link

View all items with links (distances) View all items with links (distances) between them.between them.

Finds maximal connected components Finds maximal connected components in this graph.in this graph.

Two clusters are merged if there is at Two clusters are merged if there is at least one edge which connects them.least one edge which connects them.

Uses threshold distances at each level.Uses threshold distances at each level.

Could be agglomerative or divisive.Could be agglomerative or divisive.


MST Single Link AlgorithmMST Single Link Algorithm


Single Link ClusteringSingle Link Clustering


AGNES (Agglomerative Nesting)AGNES (Agglomerative Nesting)

Implemented in statistical analysis packages, e.g., Implemented in statistical analysis packages, e.g., SplusSplus

Use the Single-Link method and the dissimilarity Use the Single-Link method and the dissimilarity matrix. matrix.

Merge nodes that have the least dissimilarityMerge nodes that have the least dissimilarity

Go on in a non-descending fashionGo on in a non-descending fashion

Eventually all nodes belong to the same clusterEventually all nodes belong to the same cluster


A Dendrogram Shows How the Clusters are Merged Hierarchically

Decompose data objects into a several levels of nested partitioning (tree of clusters), called a dendrogram.

A clustering of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.


DIANA (Divisive Analysis)DIANA (Divisive Analysis)

Implemented in statistical analysis packages, e.g., Implemented in statistical analysis packages, e.g., SplusSplus

Inverse order of AGNESInverse order of AGNES

Eventually each node forms a cluster on its ownEventually each node forms a cluster on its own


Partitional ClusteringPartitional Clustering

NonhierarchicalNonhierarchical

Creates clusters in one step as opposed to Creates clusters in one step as opposed to several steps.several steps.

Since only one set of clusters is output, Since only one set of clusters is output, the user normally has to input the desired the user normally has to input the desired number of clusters, k.number of clusters, k.

Usually deals with static sets.Usually deals with static sets.


Partitioning AlgorithmsPartitioning Algorithms

Partitioning method:Partitioning method: Construct a partition of a database Construct a partition of a database DD of of nn objects into a set of objects into a set of kk clusters clustersGiven a Given a kk, find a partition of , find a partition of k clusters k clusters that optimizes the that optimizes the chosen partitioning criterionchosen partitioning criterion– GlobalGlobal optimal: exhaustively enumerate all partitions optimal: exhaustively enumerate all partitions– Heuristic methods: Heuristic methods: k-meansk-means and and k-medoidsk-medoids

algorithmsalgorithms– k-meansk-means: Each cluster is represented by the center : Each cluster is represented by the center

of the clusterof the cluster– k-medoidsk-medoids or PAM (Partition around medoids): or PAM (Partition around medoids):

Each cluster is represented by one of the objects in Each cluster is represented by one of the objects in the cluster the cluster


Partitional AlgorithmsPartitional Algorithms

MSTMST

Squared ErrorSquared Error

K-MeansK-Means

Nearest NeighborNearest Neighbor

PAMPAM

BEABEA

GAGA


MST AlgorithmMST Algorithm


K-MeansK-MeansInitial set of clusters randomly chosen.Initial set of clusters randomly chosen.

Iteratively, items are moved among sets Iteratively, items are moved among sets of clusters until the desired set is of clusters until the desired set is reached.reached.High degree of similarity among High degree of similarity among elements in a cluster is obtained.elements in a cluster is obtained.

Given a cluster KGiven a cluster Kii={t={ti1i1,t,ti2i2,…,t,…,timim}, the }, the

cluster meancluster mean is m is mii = (1/m)(t = (1/m)(ti1i1 + … + t + … + timim))


K-Means ExampleK-Means ExampleGiven: {2,4,10,12,3,20,30,11,25}, k=2Given: {2,4,10,12,3,20,30,11,25}, k=2

Randomly assign means: mRandomly assign means: m11=3,m=3,m22=4=4

KK11={2,3}, K={2,3}, K22={4,10,12,20,30,11,25}, ={4,10,12,20,30,11,25}, mm11=2.5,m=2.5,m22=16=16

KK11={2,3,4},K={2,3,4},K22={10,12,20,30,11,25}, m={10,12,20,30,11,25}, m11=3,m=3,m22=18=18

KK11={2,3,4,10},K={2,3,4,10},K22={12,20,30,11,25}, ={12,20,30,11,25}, mm11=4.75,m=4.75,m22=19.6=19.6

KK11={2,3,4,10,11,12},K={2,3,4,10,11,12},K22={20,30,25}, m={20,30,25}, m11=7,m=7,m22=25=25

Stop as the clusters with these means are the Stop as the clusters with these means are the same.same.


The The K-MeansK-Means Clustering Clustering MethodMethod

Given k, the k-means algorithm is implemented in 4 steps:– Partition objects into k nonempty subsets– Compute seed points as the centroids of the

clusters of the current partition. The centroid is the center (mean point) of the cluster.

– Assign each object to the cluster with the nearest seed point.

– Go back to Step 2, stop when no more new assignment.


Comments on the Comments on the K-MeansK-Means Method MethodStrengthStrength – Relatively efficientRelatively efficient: : OO((tkntkn), where ), where nn is # objects, is # objects, kk is # is #

clusters, and clusters, and t t is # iterations. Normally, is # iterations. Normally, kk, , tt << << nn..– Often terminates at a Often terminates at a local optimumlocal optimum. The . The global optimumglobal optimum

may be found using techniques such as: may be found using techniques such as: deterministic deterministic annealingannealing and and genetic algorithmsgenetic algorithms

WeaknessWeakness– Applicable only when Applicable only when meanmean is defined, then what about is defined, then what about

categorical data?categorical data?– Need to specify Need to specify k, k, the the numbernumber of clusters, in advance of clusters, in advance– Unable to handle noisy data and Unable to handle noisy data and outliersoutliers– Not suitable to discover clusters with Not suitable to discover clusters with non-convex shapesnon-convex shapes


Nearest NeighborNearest Neighbor

Items are iteratively merged into the Items are iteratively merged into the existing clusters that are closest.existing clusters that are closest.

IncrementalIncremental

Threshold, t, used to determine if items Threshold, t, used to determine if items are added to existing clusters or a new are added to existing clusters or a new cluster is created.cluster is created.


Variations of the Variations of the K-MeansK-Means Method MethodA few variants of the A few variants of the k-meansk-means which differ in which differ in– Selection of the initial Selection of the initial kk means means– Dissimilarity calculationsDissimilarity calculations– Strategies to calculate cluster meansStrategies to calculate cluster means

Handling categorical data: Handling categorical data: k-modesk-modes – Replacing means of clusters with Replacing means of clusters with modesmodes– Using new dissimilarity measures to deal with Using new dissimilarity measures to deal with

categorical objectscategorical objects– Using a Using a frequencyfrequency-based method to update modes of -based method to update modes of

clustersclusters– A mixture of categorical and numerical data: A mixture of categorical and numerical data: k-prototypek-prototype

methodmethod


TheThe K K--MedoidsMedoids Clustering MethodClustering Method

Find Find representativerepresentative objects, called objects, called medoidsmedoids, in clusters, in clusters

PAMPAM (Partitioning Around Medoids,) (Partitioning Around Medoids,)

– starts from an initial set of medoids and iteratively starts from an initial set of medoids and iteratively replaces one of the medoids by one of the non-replaces one of the medoids by one of the non-medoids if it improves the total distance of the resulting medoids if it improves the total distance of the resulting clusteringclustering

– PAMPAM works effectively for small data sets, but does not works effectively for small data sets, but does not scale well for large data setsscale well for large data sets

CLARACLARA

CLARANSCLARANS): Randomized sampling Focusing + spatial ): Randomized sampling Focusing + spatial data structuredata structure


PAM (Partitioning Around Medoids)PAM (Partitioning Around Medoids)

PAM - Use real object to represent the clusterPAM - Use real object to represent the cluster

– Select Select kk representative objects arbitrarily representative objects arbitrarily

– For each pair of non-selected object For each pair of non-selected object hh and selected and selected

object object ii, calculate the total swapping cost , calculate the total swapping cost TCTCihih

– For each pair of For each pair of ii and and hh, ,

If If TCTCihih < 0, < 0, ii is replaced by is replaced by hh

Then assign each non-selected object to the most Then assign each non-selected object to the most

similar representative objectsimilar representative object

– repeat steps 2-3 until there is no changerepeat steps 2-3 until there is no change


PAMPAM

Partitioning Around Medoids (PAM) Partitioning Around Medoids (PAM) (K-Medoids)(K-Medoids)Handles outliers well.Handles outliers well.Ordering of input does not impact results.Ordering of input does not impact results.Does not scale well.Does not scale well.Each cluster represented by one item, Each cluster represented by one item, called the called the medoid.medoid.Initial set of k medoids randomly chosen.Initial set of k medoids randomly chosen.


PAMPAM


PAM AlgorithmPAM Algorithm


DBSCANDBSCAN

Density Based Spatial Clustering of Density Based Spatial Clustering of Applications with NoiseApplications with NoiseOutliers will not effect creation of cluster.Outliers will not effect creation of cluster.InputInput– MinPts MinPts – minimum number of points in – minimum number of points in

clustercluster– EpsEps – for each point in cluster there must – for each point in cluster there must

be another point in it less than this distance be another point in it less than this distance away.away.


Density ConceptsDensity Concepts


Comparison of Clustering Comparison of Clustering TechniquesTechniques

Documents

Data Mining Algorithms Clustering. M.Vijayalakshmi VESIT BE(IT) Data Mining2 Clustering Outline Goal: Provide an overview of the clustering problem and