Метод К-средних в кластер-анализе и его интеллектуализация

Метод К-средних в кластер-анализе и его интеллектуализацияБ.Г. МиркинПрофессор, Кафедра анализа данных и искусственного интеллекта, НИУ ВШЭ Москва РФProfessor Emeritus, School of Computer Science & Information Systems, Birkbeck College University of London, UK 1

Outline:Clustering as empirical classificationK-Means and its issues: (1) Determining K and initialization (2) Weighting variables

Addressing (1): Data recovery clustering and K-Means (Mirkin 1987,

1990) One-by-one clustering: Anomalous patterns and iK-

Means Other approaches Computational experiment

Addressing (2): Three-stage K-Means Minkowski K-Means Computational experiment

Conclusion2

WHAT IS CLUSTERING; WHAT IS DATAK-MEANS CLUSTERING: Conventional K-Means; Initialization of K-Means; Intelligent K-Means; Mixed Data; Interpretation AidsWARD HIERARCHICAL CLUSTERING: Agglomeration; Divisive Clustering with Ward Criterion; Extensions of Ward ClusteringDATA RECOVERY MODELS: Statistics Modelling as Data Recovery;

Data Recovery Model for K-Means; for Ward; Extensions to Other Data Types; One-by-One ClusteringDIFFERENT CLUSTERING APPROACHES: Extensions of K-Means; Graph-Theoretic Approaches; Conceptual Description of ClustersGENERAL ISSUES: Feature Selection and Extraction; Similarity on Subsets and Partitions; Validity and Reliability

3

Referred recent work:B.G. Mirkin, Chiang M. (2010) Intelligent choice of the number of clusters in K-Means clustering: An experimental study with different cluster spreads, J. of Classification, 27, 1, 3-41 B.G. Mirkin, Choosing the number of clusters (2011), WIRE Data Mining and Knowledge Discovery, 1, 3, 252-60B.G. Mirkin, R.Amorim (2012) Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering, Pattern Recognition, 45, 1061-75 4

What is clustering?

Finding homogeneous fragments, mostly sets of entities, in datasets for further analysis

5

Example: W. Jevons (1857) planet clusters (updated by Mirkin 1996)

Pluto doesn’t fit in the two clusters of planets: originated another cluster (September 2006)

6

Example: A Few ClustersClustering interface to WEB search engines (Grouper):Query: Israel (after O. Zamir and O. Etzioni 2001)

Cluster # sites Interpretation1ViewRefine

24 Society, religion• Israel and Iudaism• Judaica collection

2ViewRefine

12 Middle East, War, History• The state of Israel• Arabs and Palestinians

3ViewRefine

31 Economy, Travel• Israel Hotel Association• Electronics in Israel

7

Clustering algorithms:

Nearest neighbour Agglomerative clustering Divisive clustering Conceptual clustering

K-means Kohonen SOM Spectral clustering ………………….

8

Batch K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

K= 3 hypothetical centroids (@)

* *

* * * * * * * * @ @

@** * * *

9

K-Means: a generic clustering method

Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids

(seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* *

* * * * * * * * @ @

@** * * *

10


Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence

* *

* * * * * * * * @ @

@** * * *

11


Entities are presented as multidimensional points (*) 0. Put K hypothetical centroids (seeds) 1. Assign points to the centroids according to Minimum distance rule 2. Put centroids in gravity centres of thus obtained clusters 3. Iterate 1. and 2. until convergence 4. Output final centroids and clusters

* * @ * * * @ * * * *

** * * *

@

12

K-Means criterion: Summary distance to cluster centroids

Minimize

* * @ * * * @ * * * *

** * * *

@

kk Si

i

K

k

M

vkviv

Si

K

k

ydcycSW )c,()(),( k11

2

113

Advantages of K-Means - Models typology building - Simple “data recovery” criterion - Computationally effective - Can be utilised incrementally, `on-

line’

Shortcomings of K-Means - Initialisation: no advice on K or

initial centroids - No deep minima - No defence of irrelevant features

14

Initial Centroids: Correct

Two cluster case

15

Initial Centroids: Correct

Initial Final

16

Different Initial Centroids

17

Different Initial Centroids: Wrong

Initial Final

18

(1) To address:

* Number of clustersIssue: Criterion WK < WK-1

* Initial setting* Deeper minimum

The two are interrelated: a good initial setting leads to a deeper minimum

19

Number K: conventional approach Take a range RK of K, say K=3, 4, …, 15 For each KRK Run K-Means 100-200 times from randomly

chosen initial centroids and choose the best of them W(S,c)=WK.

Compare WK for all KRK in a special way and choose the best; such as Gap statistic (2001) Jump statistic (2003) Hartigan (1975): In the ascending order of

K, pick the first K at which HK = [ WK / WK+1 - 1 ]/(N-K-1) 10

20

(1) Addressing

* Number of clusters* Initial setting

with a PCA-like method in the data recovery approach

21

Representing a partition

Cluster k:

Centroid

ckv (v - feature)

Binary 1/0 membership

zik (i - entity)

22

Basic equations (same as for PCA, but score vectors zk constrained to be binary)

y – data entry, z – 1/0 membership, not score

c - cluster centroid, N – cardinality

i - entity, v - feature /category, k - cluster

,1

ivikz

kvc

K

kivy

23

Quadratic data scatter decomposition (Pythagorean)

K-means: Alternating LS minimisation y – data entry, z – 1/0 membership

c - cluster centroid, N – cardinality

i - entity, v - feature /category, k - cluster

K

k Si

V

vkviv

V

vk

K

kkv

N

i

V

viv

k

cyNcy1 1

2

1 1

2

1 1

2 )(

,1

ivikz

kvc

K

kivy

24

Equivalent criteria (1)A. Bilinear residuals squared MIN

Minimizing difference between data andcluster structureB. Distance-to-Centre Squared MIN

Minimizing difference between data andcluster structure

N

i Vv

ive1

2

K

kk

Si

icdWk1

2 ),(

25

Equivalent criteria (2)

C. Within-group error squared MIN

Minimizing difference between data andcluster structureD. Within-group variance Squared MIN

Minimizing within-cluster variance

2

1

)( iv

N

ikv

Vv Si

yck

K

kkk SS

1

2 )(||

26

Equivalent criteria (3)E. Semi-averaged within distance squared MIN

Minimizing dissimilarities within clustersF. Semi-averaged within similarity squared

MAX

Maximizing similarities within clusters

||/),(1 ,

2k

K

k Sji

Sjidk

jijiawhereSjia k

K

k Sji k

,),(|,|/),(1 ,

27


G. Distant Centroids MAX

Finding anomalous typesH. Consensus partition MAX

Maximizing correlation between sought partition and given variables

||1

2k

K

k Vv

kv Sc

),(1

vSV

v

28


I. Spectral Clusters MAX

Maximizing summary Raileigh quotient over binary vectors

K

kk

Tkk

TTk zzzYYz

1

/

29

PCA inspired Anomalous Pattern Clustering

yiv =cv zi + eiv,

where zi = 1 if iS, zi = 0 if iS

With Euclidean distance squared

Si

V

vSviv

V

vSSv

N

i

V

viv cyNcy

1

2

1

2

1 1

2 )(

Si

SSS

N

i

cidNcdid ),()0,()0,(1

cS must be anomalous, that is,

interesting30

Initial setting with Anomalous Pattern Cluster

Tom Sawyer

31

Anomalous Pattern Clusters: Iterate

0

Tom Sawyer

12

3

32

iK-Means:Anomalous clusters + K-meansAfter extracting 2 clusters (how one can know that 2 is right?)

Final

33

iK-Means:Defining K and Initial Setting with Iterative Anomalous Pattern Clustering

Find all Anomalous Pattern clusters Remove smaller (e.g., singleton) clusters Put the number of remaining clusters as K and initialise K-Means with their centres

34

Study of eight Number-of-clusters methods (joint work with Mark Chiang):

• Variance based:Hartigan (HK)

Calinski & Harabasz (CH) Jump Statistic (JS)• Structure based:

Silhouette Width (SW)• Consensus based:

Consensus Distribution area (CD)Consensus Distribution mean (DD)

• Sequential extraction of APs (iK-Means):Least Square (LS)Least Moduli (LM)

35

Experimental resultsat 9 Gaussian clusters (3 spread patterns), 1000 x 15 data size

Estimated number of clusters

Adjusted Rand Index

Large spread

Small spread

Large spread

Small spread

HK

CH

JS

SW

CD

DD

LS

LM

1-time winner 2-times winner 3-times winner

Two winners counted each time

36

37

(2) Address: Weighting features according to relevance

1 1 1

| | ( , )k

K M K

ik v iv kv i kk i I v k i S

s w y с d y с

w: feature weights=scale factors

3-step K-Means:-Given s, c, find w (weights)-Given w, c, find s (clusters)-Given s,w, find c (centroids)-till convergence

38

Minkowski’s centersMinimize over c

At >1, d(c) is convexGradient method

39

( ) | |k

ivi S

d с y с

Minkowski’s metric effects The more uniform distribution of the entities over a feature, the smaller its weight Uniform distribution w=0The best Minkowski power is data dependentThe best can be learnt from data in a semi-supervised manner (with clustering of all objects)Example: at Fisher’s Iris, iMWK-Means gives 5 errors only (a record)

40

Conclusion:Data recovery K-Means-wise model of

clustering is a tool that involves wealth of interesting criteria for mathematical investigation and application projects

Further work:Extending the approach to other data types

– text, sequence, image, web pageUpgrading K-Means to address the issue of

interpretation of the results

DecoderModelCoder

Data clustering Clusters Data recovery

41

HEFCE survey of students’ satisfaction

HEFCE method: ALL 93 of highest mark STRATA: 43 best, ranging 71.8 to 84.6

42

Documents

Метод К-средних в кластер-анализе и его интеллектуализация