88
Cluster Analysis: Basic Concepts and Algorithms Data Warehousing and Mining Lecture 10 by Hossen Asiful Mustafa

Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

  • Upload
    vuquynh

  • View
    237

  • Download
    4

Embed Size (px)

Citation preview

Page 1: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Cluster Analysis: Basic Concepts and Algorithms

Data Warehousing and Mining

Lecture 10

by

Hossen Asiful Mustafa

Page 2: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

What is Cluster Analysis? • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Page 3: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

What is Cluster Analysis? • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Page 4: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

What is Cluster Analysis? • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Intra-cluster distances are

minimized

Page 5: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

What is Cluster Analysis? • Finding groups of objects such that the objects in a group will

be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 6: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Applications of Cluster Analysis

• Understanding – Group related documents for

browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations

• Summarization – Reduce the size of large data

sets

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,

Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UP

Oil-UP

Clustering precipitation in Australia

Page 7: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

What is not Cluster Analysis?

• Supervised classification – Have class label information

• Simple segmentation – Dividing students into different registration groups alphabetically,

by last name

• Results of a query – Groupings are a result of an external specification

• Graph partitioning – Some mutual relevance and synergy, but areas are not identical

Page 8: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Notion of a Cluster can be Ambiguous

How many clusters?

Page 9: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Notion of a Cluster can be Ambiguous

How many clusters?

Two Clusters

Page 10: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Notion of a Cluster can be Ambiguous

How many clusters?

Two Clusters

Six Clusters

Page 11: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Notion of a Cluster can be Ambiguous

How many clusters?

Four Clusters Two Clusters

Six Clusters

Page 12: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusterings

• A clustering is a set of clusters

• Important distinction between hierarchical and partitional sets of clusters

• Partitional Clustering – A division data objects into non-overlapping subsets (clusters) such

that each data object is in exactly one subset

• Hierarchical clustering – A set of nested clusters organized as a hierarchical tree

Page 13: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Partitional Clustering

Original Points

Page 14: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Partitional Clustering

Original Points A Partitional Clustering

Page 15: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Hierarchical Clustering

p4

p1p3

p2

p4

p1 p3

p2

p4p1 p2 p3

p4p1 p2 p3

Traditional Hierarchical

Clustering

Non-traditional Hierarchical

Clustering

Non-traditional Dendrogram

Traditional Dendrogram

Page 16: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters

• Well-separated clusters

• Center-based clusters

• Contiguous clusters

• Density-based clusters

• Property or Conceptual

• Described by an Objective Function

Page 17: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters: Well-Separated

• Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is closer

(or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters

Page 18: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters: Center-Based

• Center-based – A cluster is a set of objects such that an object in a cluster is closer

(more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

Page 19: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters: Contiguity-Based

• Contiguous Cluster (Nearest neighbor or Transitive) – A cluster is a set of points such that a point in a cluster is closer (or

more similar) to one or more other points in the cluster than to any point not in the cluster.

8 contiguous clusters

Page 20: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters: Density-Based

• Density-based – A cluster is a dense region of points, which is separated by low-

density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

Page 21: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Types of Clusters: Conceptual Clusters

• Shared Property or Conceptual Clusters – Finds clusters that share some common property or represent a

particular concept.

.

2 Overlapping Circles

Page 22: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Clustering Algorithms

• K-means and its variants

• Hierarchical clustering

• Density-based clustering

Page 23: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering

• Partitional clustering approach • Each cluster is associated with a centroid (center point) • Each point is assigned to the cluster with the closest

centroid • Number of clusters, K, must be specified • The basic algorithm is very simple

Page 24: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering – Details

• Initial centroids are often chosen randomly. – Clusters produced vary from one run to another.

• The centroid is (typically) the mean of the points in the cluster.

• ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

• K-means will converge for common similarity measures mentioned above.

• Most of the convergence happens in the first few iterations. – Often the stopping condition is changed to ‘Until relatively few points

change clusters’

• Complexity is O( n * K * I * d ) – n = number of points, K = number of clusters,

I = number of iterations, d = number of attributes

Page 25: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

Page 26: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

Page 27: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

Page 28: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

Page 29: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 30: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 31: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

K-means Clustering Example

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 32: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Evaluating K-means Clusters

K

i Cx

i

i

xmdistSSE1

2 ),(

• Most common measure is Sum of Squared Error (SSE) – For each point, the error is the distance to the nearest cluster

– To get SSE, we square these errors and sum them.

– x is a data point in cluster Ci and mi is the representative point for cluster Ci • can show that mi corresponds to the center (mean) of the cluster

– Given two clusters, we can choose the one with the smallest error

– One easy way to reduce SSE is to increase K, the number of clusters • A good clustering with smaller K can have a lower SSE than a poor

clustering with higher K

Page 33: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Original Points

Page 34: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 35: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 36: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

Page 37: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

Page 38: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

Page 39: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

Page 40: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

Page 41: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

xy

Iteration 5

Page 42: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Problems with Selecting Initial Points

• If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small. – Chance is relatively small when K is large

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t

– Consider an example of five pairs of clusters

Page 43: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

Starting with two initial centroids in one cluster of each pair of clusters

Page 44: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

Starting with two initial centroids in one cluster of each pair of clusters

Page 45: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

Starting with two initial centroids in one cluster of each pair of clusters

Page 46: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 47: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 48: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

Page 49: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

Page 50: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

Page 51: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 52: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 53: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Solutions to Initial Centroids Problem

• Multiple runs – Helps, but probability is not on your side

• Sample and use hierarchical clustering to determine initial centroids – Sample size and K needs to be small

• Select first centroid and then select an point that is farthest away from existing centroids – Computationally expensive

• Bisecting K-means – Not as susceptible to initialization issues

Page 54: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Handling Empty Clusters

• Basic K-means algorithm can yield empty clusters

• Need to choose a replacement centroid

• Several strategies

– Choose a point that is farthest away from any current centrold

– Choose the point that contributes most to SSE

– Choose a point from the cluster with the highest SSE

– If there are several empty clusters, the above can be repeated several times.

Page 55: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means

• Bisecting K-means algorithm – Variant of K-means that can produce a partitional or a hierarchical

clustering

Page 56: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 57: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 58: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 59: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 60: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 61: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 62: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 63: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 64: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 65: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Bisecting K-means Example

Page 66: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means

• K-means has problems when clusters are of differing

– Sizes

– Densities

– Non-globular shapes

• K-means has problems when the data contains outliers.

Page 67: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Sizes

Original Points

Page 68: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 69: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 70: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Density

Original Points

Page 71: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 72: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 73: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Non-globular Shapes

Original Points

Page 74: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 75: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 76: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.

Page 77: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Overcoming K-means Limitations

Original Points K-means Clusters

Page 78: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

Overcoming K-means Limitations

Original Points K-means Clusters

Page 79: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

DBSCAN

• DBSCAN is a density-based algorithm. – Density = number of points within a specified radius (ε: Eps)

– A point is a core point if it has more than a specified number of

points (MinPts) within Eps • These are points that are at the interior of a cluster

– A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point

– A noise point is any point that is not a core point or a border point.

Page 80: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

DBSCAN: Core, Border, and Noise Points

Page 81: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

DBSCAN Algorithm

• Eliminate noise points

• Perform clustering on the remaining points

Page 82: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

DBSCAN: Core, Border and Noise Points

Original Points Point types: core,

border and noise

Eps = 10, MinPts = 4

Page 83: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

When DBSCAN Works Well

Original Points

Page 84: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

When DBSCAN Works Well

Original Points Clusters

Page 85: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

When DBSCAN Works Well

Original Points Clusters

• Resistant to Noise

• Can handle clusters of different shapes and sizes

Page 86: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

When DBSCAN Does NOT Work Well

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

Page 87: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

When DBSCAN Does NOT Work Well

Original Points

(MinPts=4, Eps=9.75).

(MinPts=4, Eps=9.92)

• Varying densities

• High-dimensional data

Page 88: Cluster Analysis: Basic Concepts and Algorithmsteacher.buet.ac.bd/hossen_mustafa/download/ICT6522_Lecture-10.pdf · Cluster Analysis: Basic Concepts and Algorithms Data Warehousing

DBSCAN: Determining EPS and MinPts

• Idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

• Noise points have the kth nearest neighbor at farther distance

• So, plot sorted distance of every point to its kth nearest neighbor

Plot for k = 4

Eps = 10

MinPts = k = 4