65
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 7 Introduction to Data Mining, 2 nd Edition by Tan, Steinbach, Karpatne, Kumar …with some additions by your instructor

Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

  • Upload
    others

  • View
    5

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Data MiningCluster Analysis: Basic Concepts

and Algorithms

Lecture Notes for Chapter 7

Introduction to Data Mining, 2nd Edition

by

Tan, Steinbach, Karpatne, Kumar

…with some additions by your instructor

Page 2: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Classification and Clustering

Preliminaries

Page 3: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Classification problem

• What we have• A set of objects, each of them

described by some features• people described by age, gender,

height, etc.• bank transactions described by type,

amount, time, etc.

• What we want to do• Associate the objects of a set to a

class, taken from a predefined list• “good customer” vs. “churner”• “normal transaction” vs. “fraudulent”• “low risk patient” vs. “risky”

?

?

?

?

?

Feature 1(e.g. Age)

Fea

ture

2(e

.g. I

nco

me)

15k€

50y

35k€

60y

Page 4: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Classification problem

• What we know• No domain knowledge or theory

• Only examples: Training Set• Subset of labelled objects

• What we can do• Learn from examples

• Make inferences about the other objects

Page 5: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Classify by similarity

• K-Nearest Neighbors• Decide label based on K most similar examples

K=3

Page 6: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Build a model

• Example 1: linear separation line

Page 7: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Build a model

• Example 2: Support Vector Machine (linear)

Page 8: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Build a model

• Example 3: Non-linear separation line

Page 9: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Build a model

• Decision Trees

Income > 15k€ ?

Age > 50y ?

Age

Inco

me

yes no

yes no

Page 10: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering

• Classification starts from predefined labels and some examples to learn

Page 11: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering

• What if no labels are known?• We might lack examples• Labels might actually not exist at all…

Page 12: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering

• Objective: find structure in the data

• Group objects into clusters of similar entities

Page 13: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering: K-means (family)

• Find k subgroups that form compact and well-separated clusters

K=3

Cluster compactness

Cluster separation

Page 14: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering: K-means (family)

• Output 1: a partitioning of the initial set of objects

K=3

Page 15: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering: K-means (family)

• Output 2: K representative objects (centroids)

• Centroid = average profile of the objects in the cluster

K=3

• Avg. age• Avg. weight• Avg. income• Avg. .n children• …

Page 16: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering: hierarchical approaches

• Sometimes we can have (or desire) multiple levels of aggregation

Page 17: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Clustering: hierarchical approaches

• Sometimes we can have (or desire) multiple levels of aggregation

Dendogram

Page 18: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

The many notions of «cluster»

Basics

Page 19: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 19

What is Cluster Analysis?

Finding groups of objects such that the objects in a group

will be similar (or related) to one another and different

from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 20: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 20

Applications of Cluster Analysis

Understanding

– Group related documents

for browsing, group genes

and proteins that have

similar functionality, or

group stocks with similar

price fluctuations

Summarization

– Reduce the size of large

data sets

Discovered Clusters Industry Group

1 Applied-Matl-DOWN,Bay-Network-Down,3-COM-DOWN,

Cabletron-Sys-DOWN,CISCO-DOWN,HP-DOWN,

DSC-Comm-DOWN,INTEL-DOWN,LSI-Logic-DOWN,

Micron-Tech-DOWN,Texas-Inst-Down,Tellabs-Inc-Down,

Natl-Semiconduct-DOWN,Oracl-DOWN,SGI-DOWN,

Sun-DOWN

Technology1-DOWN

2 Apple-Comp-DOWN,Autodesk-DOWN,DEC-DOWN,

ADV-Micro-Device-DOWN,Andrew-Corp-DOWN,

Computer-Assoc-DOWN,Circuit-City-DOWN,

Compaq-DOWN, EMC-Corp-DOWN, Gen-Inst-DOWN,

Motorola-DOWN,Microsoft-DOWN,Scientific-Atl-DOWN

Technology2-DOWN

3 Fannie-Mae-DOWN,Fed-Home-Loan-DOWN,

MBNA-Corp-DOWN,Morgan-Stanley-DOWN

Financial-DOWN

4 Baker-Hughes-UP,Dresser-Inds-UP,Halliburton-HLD-UP,

Louisiana-Land-UP,Phillips-Petro-UP,Unocal-UP,

Schlumberger-UP

Oil-UP

Clustering precipitation

in Australia

Page 21: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 21

What is not Cluster Analysis?

Simple segmentation– Dividing students into different registration groups

alphabetically, by last name

Results of a query– Groupings are a result of an external specification

– Clustering is a grouping of objects based on the data

Supervised classification– Have class label information

Association Analysis– Local vs. global connections

Page 22: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 22

Notion of a Cluster can be Ambiguous

How many clusters?

Four ClustersTwo Clusters

Six Clusters

Page 23: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 23

Types of Clusterings

A clustering is a set of clusters

Important distinction between hierarchical and partitional sets of clusters

Partitional Clustering– A division of data objects into non-overlapping subsets

(clusters) such that each data object is in exactly one subset

Hierarchical clustering– A set of nested clusters organized as a hierarchical tree

Page 24: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 24

Partitional Clustering

Original Points A Partitional Clustering

Page 25: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 25

Hierarchical Clustering

p4

p1 p3

p2

p4p1 p2 p3

Traditional Hierarchical Clustering Traditional Dendrogram

Page 26: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 26

Other Distinctions Between Sets of Clusters

Exclusive versus non-exclusive– In non-exclusive clusterings, points may belong to multiple

clusters.

– Can represent multiple classes or ‘border’ points

Fuzzy versus non-fuzzy– In fuzzy clustering, a point belongs to every cluster with some

weight between 0 and 1

– Weights must sum to 1

– Probabilistic clustering has similar characteristics

Partial versus complete– In some cases, we only want to cluster some of the data

Heterogeneous versus homogeneous– Clusters of widely different sizes, shapes, and densities

Page 27: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 27

Types of Clusters

Well-separated clusters

Center-based clusters

Contiguous clusters

Density-based clusters

Property or Conceptual

Described by an Objective Function

Page 28: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 28

Types of Clusters: Well-Separated

Well-Separated Clusters: – A cluster is a set of points such that any point in a cluster is

closer (or more similar) to every other point in the cluster than to any point not in the cluster.

3 well-separated clusters

Page 29: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 29

Types of Clusters: Center-Based

Center-based– A cluster is a set of objects such that an object in a cluster is

closer (more similar) to the “center” of a cluster, than to the center of any other cluster

– The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster

4 center-based clusters

Page 30: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 30

Types of Clusters: Contiguity-Based

Contiguous Cluster (Nearest neighbor or Transitive)– Each point is closer to at least one point in its cluster than to

any point in another cluster.

– Graph based clustering

This approach can have trouble when noise is present since a small bridge of points can merge two distinct clusters

8 contiguous clusters

Page 31: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 31

Types of Clusters: Density-Based

Density-based– A cluster is a dense region of points, which is separated by

low-density regions, from other regions of high density.

– Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

Page 32: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 32

Types of Clusters: Objective Function

Clusters Defined by an Objective Function

– Finds clusters that minimize or maximize an objective

function.

– Enumerate all possible ways of dividing the points into

clusters and evaluate the `goodness' of each potential

set of clusters by using the given objective function.

(NP Hard)

– Can have global or local objectives.

Hierarchical clustering algorithms typically have local objectives

Partitional algorithms typically have global objectives

Page 33: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 33

Characteristics of the Input Data Are Important

Type of proximity or density measure– Central to clustering

– Depends on data and application

Data characteristics that affect proximity and/or density are

– Dimensionality Sparseness

– Attribute type

– Special relationships in the data For example, autocorrelation

– Distribution of the data

Noise and Outliers

– Often interfere with the operation of the clustering algorithm

Page 34: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Three fundamental clustering algorithms

Methods

Page 35: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 35

Clustering Algorithms

K-means and its variants

Hierarchical clustering

Density-based clustering

Page 36: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 36

K-means Clustering

Partitional clustering approach

Number of clusters, K, must be specified

Each cluster is associated with a centroid (center point)

Each point is assigned to the cluster with the closest centroid

The basic algorithm is very simple

Page 37: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Example of K-means Clustering

Page 38: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Example of K-means Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

Page 39: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 39

Example of K-means Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 40: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 40

K-means Clustering – Details

Initial centroids are often chosen randomly.

– Clusters produced vary from one run to another.

The centroid is (typically) the mean of the points in the cluster.

‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.

K-means will converge for common similarity measures mentioned above.

Most of the convergence happens in the first few iterations.

– Often the stopping condition is changed to ‘Until relatively few points change clusters’

Complexity is O( n * K * I * d )

– n = number of points, K = number of clusters, I = number of iterations, d = number of attributes

Page 41: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 41

Evaluating K-means Clusters

Most common measure is Sum of Squared Error (SSE)– For each point, the error is the distance to the nearest cluster

– To get SSE, we square these errors and sum them.

K

i Cx

i

i

xmdistSSE

1

2),(

x is a data point in cluster Ci and mi is the representative point for cluster Ci

• can show that mi corresponds to the center (mean) of the cluster

Given two sets of clusters, we prefer the one with the smallest error

One easy way to reduce SSE is to increase K, the number of clusters

A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

Page 42: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 42

Two different K-means Clusterings

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Sub-optimal Clustering

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Optimal Clustering

Original Points

Page 43: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 43

Limitations of K-means

K-means has problems when clusters are of

differing

– Sizes

– Densities

– Non-globular shapes

K-means has problems when the data contains

outliers.

Page 44: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 44

Limitations of K-means: Differing Sizes

Original Points K-means (3 Clusters)

Page 45: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 45

Overcoming K-means Limitations

Original Points K-means Clusters

One solution is to use many clusters.

Find parts of clusters, but need to put together.

Page 46: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 46

Limitations of K-means: Differing Density

Original Points K-means (3 Clusters)

Page 47: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 47

Overcoming K-means Limitations

Original Points K-means Clusters

Page 48: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 48

Limitations of K-means: Non-globular Shapes

Original Points K-means (2 Clusters)

Page 49: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 49

Overcoming K-means Limitations

Original Points K-means Clusters

Page 50: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 50

Empty Clusters

K-means can yield empty clusters

6.5 9 10 15 16 18.5X X X

7.75 12.5 17.25

6.5 9 10 15 16 18.5X X X6.8 13 18

Empty

Cluster

Page 51: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 51

Handling Empty Clusters

Basic K-means algorithm can yield empty

clusters

Several strategies

Choose a point and assign it to the cluster

Choose the point that contributes most to SSE

Choose a point from the cluster with the highest SSE

If there are several empty clusters, the above can

be repeated several times.

Page 52: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 52

Pre-processing and Post-processing

Pre-processing

– Normalize the data

– Eliminate outliers

Post-processing

– Eliminate small clusters that may represent outliers

– Split ‘loose’ clusters, i.e., clusters with relatively high

SSE

– Merge clusters that are ‘close’ and that have relatively

low SSE

– Can use these steps during the clustering process

ISODATA

Page 53: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 6

Page 54: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 54

Importance of Choosing Initial Centroids

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 5

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 6

Page 55: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

yIteration 5

Page 56: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 56

Importance of Choosing Initial Centroids …

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 1

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 2

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 3

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

x

y

Iteration 4

-2 -1.5 -1 -0.5 0 0.5 1 1.5 2

0

0.5

1

1.5

2

2.5

3

xy

Iteration 5

Page 57: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 57

Problems with Selecting Initial Points

If there are K ‘real’ clusters then the chance of selecting one centroid from each cluster is small.

– Chance is relatively small when K is large

– If clusters are the same size, n, then

– For example, if K = 10, then probability = 10!/1010 = 0.00036

– Sometimes the initial centroids will readjust themselves in ‘right’ way, and sometimes they don’t

– Consider an example of five pairs of clusters

Page 58: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 58

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 59: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 59

10 Clusters Example

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Starting with two initial centroids in one cluster of each pair of clusters

Page 60: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 60

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other

have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 4

Page 61: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 61

10 Clusters Example

Starting with some pairs of clusters having three initial centroids, while other have only one.

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

yIteration 1

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 2

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 3

0 5 10 15 20

-6

-4

-2

0

2

4

6

8

x

y

Iteration 4

Page 62: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 62

Solutions to Initial Centroids Problem

Multiple runs

– Helps, but probability is not on your side

Sample and use hierarchical clustering to determine initial centroids

Select more than k initial centroids and then select among these initial centroids

– Select most widely separated

Postprocessing

Generate a larger number of clusters and then perform a hierarchical clustering

Bisecting K-means

– Not as susceptible to initialization issues

Page 63: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 63

Updating Centers Incrementally

In the basic K-means algorithm, centroids are

updated after all points are assigned to a centroid

An alternative is to update the centroids after

each assignment (incremental approach)

– Each assignment updates zero or two centroids

– More expensive

– Introduces an order dependency

– Never get an empty cluster

– Can use “weights” to change the impact

Page 64: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 64

Bisecting K-means

Bisecting K-means algorithm– Variant of K-means that can produce a partitional or a

hierarchical clustering

CLUTO: http://glaros.dtc.umn.edu/gkhome/cluto/cluto/overview

Page 65: Data Mining Cluster Analysis: Basic Concepts and Algorithmsdidawiki.cli.di.unipi.it/lib/exe/fetch.php/dm/basic... · 2020. 10. 12. · MBNA -Corp-DOWN,Morgan-Stanley-DOWN Financial-DOWN

02/14/2018 Introduction to Data Mining, 2nd Edition 65

Bisecting K-means Example