Data mining clustering-2009-v0

Data Mining

Cluster Analysis

Prithwis Mukerjee, Ph.D.

Prithwis Mukerjee 2

If we were using “Classification”

Name Eggs Pouch Flies Feathers ClassCockatoo Yes No Yes Yes BirdDugong No No No No MammalEchidna Yes Yes No No MarsupialEmu Yes No No Yes Bird

Kangaroo No Yes No No MarsupialKoala No Yes No No Marsupial

Kokkabura Yes No Yes Yes BirdOwl Yes No Yes Yes Bird

Penguin Yes No No Yes BirdPlatypus Yes No No No MammalPossum No Yes No No MarsupialWombat No Yes No No Marsupial

We would be looking at a data like this ...

Prithwis Mukerjee 3

But in “Cluster Analysis” we do NOT have

Name Eggs Pouch Flies Feathers ClassCockatoo Yes No Yes Yes Bird

No No No No MammalYes Yes No No Marsupial

Emu Yes No No Yes BirdKangaroo No Yes No No MarsupialKoala No Yes No No Marsupial

Yes No Yes Yes BirdOwl Yes No Yes Yes Bird

Penguin Yes No No Yes BirdPlatypus Yes No No No MammalPossum No Yes No No MarsupialWombat No Yes No No Marsupial

DugongEchidna

Kokkabura

Previous knowledge or expertise to define these classes !!

We have to look at the attributes alone and somehow group the data into clusters.

Prithwis Mukerjee 4

What is a cluster ?

A cluster contains objects that are “similar”

There is no unique definition of similarity. It depends on the situation Elements of the periodic table

Can be clustered along physical or chemical properties Customer can be clustered as

High value, High “pain” or high “ maintainance”, High volume, ....

Risky, credit worthy, suspicious ....

So similarity will depend on Choice of attributes of an object A credible definition of “similarity” of these attributes The “distance” between two objects based on the

values of the respective attributes

Prithwis Mukerjee 5

What is “distance” between two objects

This depends on the nature of the attribute Quantitative Attributes are easiest and most common

Height, weight, value, price, score ... Distance can be the difference between values

Binary Attributes are also common, but not easy Gender, Marital Status, Employment status ... Distance can be in terms of the RATIO OF number of

attributes with same value TO the total number of similar attributes

Quality nominal attributes, similar to binary attributes, but can take more than two values, that are NOT ranked Religion, Complexion, Colour of Hair ..

Quality ordinal attributes that can be ranked in some order Size ( S, M, L, XL ), Grade (A, B, C, D) Can be converted to a numerical scale

Prithwis Mukerjee 6

“Distance” between two objects

There are many ways to calculate distance but ...

All definitions of distance must have the following properties Distance is always positive Distance from object X ( or point X ) to itself must be zero Distance (X Y) Distance (X Z) + Distance (Z Y) Distance (X Y) = Distance (Y X)

Care must be taken in choosing Attributes : use the most descriptive or discriminatory

attribute Scale of values : it may make sense to “normalise” all

distance metrics using the mean and standard deviation To guard against one attribute dominating over the others

Prithwis Mukerjee 7

Finally : Distance

Euclidean Distance D(x,y) = √ ∑(xi - yi)

2

The L2 norm of the difference vector

Manhattan Distance D(x,y) = ∑ |xi – yi |

The L1 norm of the difference vector yields similar results

Chebychev Distance D(x,y) = Max |xi – yi |

Also called the L∞ norm

Categorical Data Distance D(x,y) = (number of times xi = yi) / N Where N is number of categorical attributes

Prithwis Mukerjee 8

Clustering : Partitioning Method

Results in single level of partitioning Clusters are NOT nested inside other clusters

Given n objects define k ≤ n clusters Each cluster has at least one object Each object belongs to only one cluster

Objects assigned to clusters iteratively Objects may be reassigned to another cluster

during the process of clustering

The number of clusters is defined up front

Aim is to LOW variance WITHIN a cluster HIGH variance ACROSS different clusters

Prithwis Mukerjee 9

Partitioning : K-means / K-median method

Set the number of clusters = k

Pick k seeds as 'centroids' of each cluster This may be done randomly OR intelligently

Compute Distance of each object from centroid Euclidean : for K-means Manhattan : for K-median

Allocate each object to a cluster depending on its proximity to the centroid

Iteration Re-calculate centroid of each cluster, based on

objects Re-compute distance of each object from centroid Re-allocate objects to clusters based on new centroid

Stop IF new clusters have same members as old clusters, ELSE continue iteration

Prithwis Mukerjee 10

Let us try to cluster this data ...

Our initial centroids are the first three students

Though these could have been any other point

Student Age Marks 1 Marks 2 Marks 3s1 18 73 75 57s2 18 79 85 75s3 23 70 70 52s4 20 55 55 55s5 22 85 86 87s6 19 91 90 89s7 20 70 65 60s8 21 53 56 59s9 19 82 82 60s10 47 75 76 77

Centroid Age Marks 1 Marks 2 Marks 3C1 18 73 75 57C2 18 79 85 75C3 23 70 70 52


We assign each student to a cluster

Based on closest distance from centroid

We note that C1 = { s1, s9 }

C2 = { s2, s5, s6, s10}

C3 = { s3, s4, s7, s8 }

Centroid Age Marks 1 Marks 2 Marks 3C1 18.00 73.00 75.00 57.00C2 18.00 79.00 85.00 75.00C3 23.00 70.00 70.00 52.00 C1 C2 C3

Student Age Marks 1 Marks 2 Marks 3s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2

Distance from Centroid of Cluster Being

assigned to cluster


Now we re-calculate the centroids Of each cluster based on the values of the

attributes of the members of the cluster Centroid Age Marks 1 Marks 2 Marks 3

C1 18.00 73.00 75.00 57.00C2 18.00 79.00 85.00 75.00C3 23.00 70.00 70.00 52.00 C1 C2 C3

Student Age Marks 1 Marks 2 Marks 3s1 18.00 73.00 75.00 57.00 0.00 34.00 18.00 C1

s2 18.00 79.00 85.00 75.00 34.00 0.00 52.00 C2

s3 23.00 70.00 70.00 52.00 18.00 52.00 0.00 C3

s4 20.00 55.00 55.00 55.00 42.00 76.00 36.00 C3

s5 22.00 85.00 86.00 87.00 57.00 23.00 67.00 C2

s6 19.00 91.00 90.00 89.00 66.00 32.00 82.00 C2

s7 20.00 70.00 65.00 60.00 18.00 46.00 16.00 C3

s8 21.00 53.00 56.00 59.00 44.00 74.00 40.00 C3

s9 19.00 82.00 82.00 60.00 20.00 22.00 36.00 C1

s10 47.00 75.00 76.00 77.00 52.00 44.00 60.00 C2

Centroid Age Marks 1 Marks 2 Marks 3C1 18.00 73.00 75.00 57.00C2 18.00 79.00 85.00 75.00C3 23.00 70.00 70.00 52.00

New C1 18.50 77.50 78.50 58.50New C2 26.50 82.50 84.30 82.00New C3 21.00 61.50 61.50 56.50


assigned to cluster


Second Iteration of Assignments

Based on closest distance from new centroids ..

Sets are ... same as the old set !! C1 = { s1, s9 }

C2 = { s2, s5, s6, s10}

C3 = { s3, s4, s7, s8 }

Centroid Age Marks 1 Marks 2 Marks 3C1 18.50 77.50 78.50 58.50C2 26.50 82.50 84.30 82.00C3 21.00 61.50 61.50 56.50 C1 C2 C3

Student Age Marks 1 Marks 2 Marks 3s1 18.00 73.00 75.00 57.00 10.00 52.30 28.00 C1

s2 18.00 79.00 85.00 75.00 25.00 19.80 62.00 C2

s3 23.00 70.00 70.00 52.00 27.00 60.30 23.00 C3

s4 20.00 55.00 55.00 55.00 51.00 90.30 16.00 C3

s5 22.00 85.00 86.00 87.00 47.00 13.80 79.00 C2

s6 19.00 91.00 90.00 89.00 56.00 28.80 92.00 C2

s7 20.00 70.00 65.00 60.00 24.00 60.30 16.00 C3

s8 21.00 53.00 56.00 59.00 50.00 86.30 17.00 C3

s9 19.00 82.00 82.00 60.00 10.00 32.30 46.00 C1

s10 47.00 75.00 76.00 77.00 52.00 41.30 74.00 C2


assigned to cluster

STOPSTOP


Some thoughts ....

How good is the clustering ? Within cluster variance is low Across cluster variances are higher Hence the clustering is good.

Can it be improved ? Clustering was guided by the Marks, not so much by

age We might considering scaling all the attributes

Xi = (xi – μx) / σx

Is this the only way to create clusters ? NO We could start with a different set of seeds and we

might end up with another set of clusters K-Means is a “hill climbing” algorithm that finds

local optima, NOT the global optima

C1 C2 C3C1 5.9 26.5 23.3

C2 29.5 14.3 42.6C3 23.9 41 10.7

Documents

Data mining clustering-2009-v0