2016年1月5日星期二 2016年1月5日星期二 2016年1月5日星期二 Data Mining: Concepts and Techniques1 Cluster Analysis (Clustering) ©Jiawei Han and Micheline Kamber Intelligent

二〇二三年四月二十一日 Data Mining: Concepts and Techniques 1

Cluster Analysis(Clustering)

©Jiawei Han and Micheline Kamber

Intelligent Database Systems Research Lab

School of Computing Science

Simon Fraser University, Canada

http://www.cs.sfu.ca


Chapter 7. Cluster Analysis

What is Cluster Analysis? Types of Data in Cluster Analysis A Categorization of Major Clustering Methods Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Model-Based Clustering Methods Outlier Analysis Summary


General Applications of Clustering

Business/Economic Science (especially marketing research)

Biomedical Informatics Pattern Recognition Spatial Data Analysis

detect spatial clusters and explain them in spatial data mining

WWW Document profiling Cluster log data to discover groups of similar

access patterns

Domain Knowledge is very, very important to the applications of clustering !!


Specific Examples of Clustering Applications

Marketing: Help marketers discover distinct group profiles of customers, and then use this knowledge to develop targeted marketing programs

Insurance: Identifying distinct group profiles of motor insurance policy holders with a high average claim cost

Biology: Derive plant and animal taxonomies, categorize genes with similar functionality, and gain insight into the structures inherent in populations


What Is Good Clustering?

A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity

The quality of a clustering result depends on both the similarity measure and the clustering method.

The quality of a clustering method is measured by its ability to discover some or all of the hidden patterns.


Issues of Clustering in Data Mining

Scalability (important for Big Data Analysis) Ability to deal with different types of attributes Discovery of clusters with arbitrary shape Minimal requirements for domain knowledge to

determine input parameters Able to deal with noise and outliers Insensitive to order of input records Able to deal with high dimensionality Incorporation of user-specified constraints Interpretability and usability





Data Structures in Clustering

Two data structures Data matrix

n objects × p variables

Dissimilarity matrix n objects × n objects

npx...nfx...n1x

...............ipx...ifx...i1x

...............1px...1fx...11x

0...)2,()1,(

:::

)2,3()

...ndnd

0dd(3,1

0d(2,1)

0


Similarity Measurement between Samples

Dissimilarity/Similarity metric: Similarity is expressed in terms of a distance function d (i, j), which is typically metric:

The definitions of distance functions are usually very different for interval-scaled (numeric), boolean, categorical, ordinal and ratio variables.

It is hard to define “similar enough” or “good enough” the answer is typically highly subjective.

Sometimes, weights should be associated with different variables based on applications and data semantics.


Type of data in clustering analysis

Interval-scaled variables data has order and equal intervals measured on a linear scale the intervals keep the same importance

throughout the scale

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types


Data normalization(Interval-valued variables)

Data normalization (for each feature f )

1. Calculate the mean:

2. Calculate the mean absolute deviation :

1. Calculate the standardized measurement (z-score)

(zero-mean normalization)

Using mean absolute deviation is more robust to outliers

than using standard deviation for cluster analysis

.)...21

1nffff

xx(xn m

|)|...|||(|121 fnffffff

mxmxmxns

f

fifif s

mx z

zero-mean normalization

))(...)()((1 22

2

2

1

'fnfffff

mxmxmxns f


Similarity and Dissimilarity Between Data Objects

Distances are normally used to measure the similarity

or dissimilarity between two data objects xi and xj (if data objects are viewed as points in data space)

Some popular ones include: Minkowski distance:

where xi = <xi1, xi2, …, xip> and xj = <xj1, xj2, …, xjp>

are two p-dimensional data points, and q is a positive

integer If q = 1, d is Manhattan distance

q q

jpip

q

ji

q

jijixxxxxxxxd )||...|||(|),(

2211

||...||||),(2211 jpipjijiji xxxxxxxxd


Similarity and Dissimilarity Between Objects (Cont.)

If q = 2, d is Euclidean distance:

Properties d(i,j) 0 d(i,i) = 0 d(i,j) = d(j,i) d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson product moment correlation (or simply correlation r), or other disimilarity measures.

)||...|||(|),( 22

22

2

11 jpipjijijixxxxxxxxd


Distance Functions for Binary Variables

A contingency table for binary data

a : the # of variables that equal 1 for both objects i and j d : the # of variables that equal 0 for both objects i and j b : the # of variables that equal 1 for object i but that

equal 0 for object j c : the # of variables that equal 0 for object i but that

equal 1 for object j

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j


A contingency table for binary data

Simple matching coefficient (if the binary variable

is symmetric):

Jaccard coefficient (if the binary variable is

asymmetric):

dcbacb jid

),(

pdbcasum

dcdc

baba

sum

0

1

01

cbacb jid

),(

Object i

Object j

Distance Functions for Binary Variables

cbaa jidjisim

),(1),(


Dissimilarity between Binary Variables

Example

Gender is a symmetric attribute which is ignored here

The remaining attributes are asymmetric binary Let the values Y and P be set to 1, and the value

N be set to 0

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N


Dissimilarity between Binary Variables

(Asymmetric Example)

Name Fever Cough Test-1 Test-2 Test-3 Test-4

Jack 1 0 1 0 0 0 Mary 1 0 1 0 1 0 Jim 1 1 0 0 0 0

75.0211

21),(

67.0111

11),(

33.0102

10),(

maryjimd

jimjackd

cba

cbmaryjackd

pdbcasum

dcdc

baba

sum

0

1

01

Object i

Object j

Name Gender Fever Cough Test-1 Test-2 Test-3 Test-4

Jack M Y N P N N N Mary F Y N P N P N Jim M Y P N N N N

PreprocessingIgnore genderY, P → 1N → 0

Conclusion : The clinical conditions of Jack and Mary are more similar.


Nominal Variables

A generalization of the binary variable in that it can take more than 2 states, e.g., color: red, yellow, blue, …

Dissimilarity between Nominal Variables Method 1: Simple matching

m : # of matches, p : total # of variables

Method 2: use a large number of asymmetric binary variables creating a new binary variable for each of the M nominal states of

a particular variable, e.g. <0, 0, 0, 0, 0, 0, 1> for purple color Method 3: use a small number of additive binary variables

<1, 0, 1> for purple color (in 3-bit RGB representation)

pmpjid ),(


Dissimilarity between Ordinal Variables

An ordinal variable can be discrete or continuous value The values are ordered in a meaningful sequence, e.g.,

job ranks It can be treated like interval-scaled (for each feature f )

1. Define rank for feature values with M f states

2. Replacing xif (the value of f-th variable of i-th object ) by a

rank

3. Map the range of feature f onto [0, 1] by replacing its corresponding rank by

Compute the dissimilarity using methods for interval-scaled variables

1M1

f

ifif

rz

}...,,{rfif

M1

ifr

)( 10, ifififif

zzrx


Dissimilarity between Ratio-Scaled Variables

Ratio-scaled variable: a positive measurement on a nonlinear scale, approximately at exponential scale,

such as AeBt or Ae-Bt

Alternative methods:

1. treat them like interval-scaled variables — not a good choice! (why? Since the scale may be distorted)

2. apply logarithmic transformation yif = log(xif), and treat

the transformed value like interval-scaled value

3. treat them as continuous ordinal data and treat their ranks as interval-scaled.


Variables of Mixed Types

A database may contain all the six types of variables symmetric binary, asymmetric binary, nominal,

ordinal, interval and ratio-scaled variables. One may use a weighted formula to combine their

effects.

If feature f is binary or nominal:dij

(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

If f is interval-based: use the normalized distance If feature f is ordinal or ratio-scaled

compute ranks rif and treat rif as interval-scaled

f

ff

w

jidwjid p

f

p

f

1

1),(

),(

1

1

f

if

Mrz

if

(distance between objects i and j)

)(ififif

zrx





Major Clustering Approaches

Partitioning algorithms: Construct a number of partitions and

then improve them iteratively by some optimization criterion

Hierarchy algorithms: Create a hierarchical structure for the

data set using some optimization criterion

Density-based: Based on connectivity and/or density functions

Grid-based: Quantize the data space into a finite number of grid

cells on which cluster analysis is performed

Model-based: A cluster model is hypothesized for each data

cluster, and find the best fit of the data to the given cluster models





Partitioning Algorithms: Basic Concept

(Cluster Representation) Partitioning method: Given a database D of n objects, a

partition algo. organizes objects into k partitions. Each represents a cluster

Given a k, find k clusters that optimizes the chosen partitioning criterion (based on similarity function) Global optimal solution: exhaustively enumerate all combinations Heuristic methods: k-means and k-medoids algorithms k-means (MacQueen’67): Each cluster is represented by the

center of the cluster during iterative clustering process k-medoids or PAM (Partition around medoids) (Kaufman &

Rousseeuw’87): Each cluster is represented by the representative object of the cluster during iterative clustering process


Centroid-Based Clustering Technique

(The k-Means Clustering Method )

Given k, the k-Means algorithm is implemented in 4 steps:

1. Randomly select k of the data objects as the seed points. Each seed point represents a cluster.

2. Re-assign each data object to a cluster with the nearest seed point. (n objects, k clusters)

3. Compute new seed points which are the centroids of the clusters at this epoch. The centroid is the center (mean) of a cluster.

4. Go back to Step 2, stop when no more new data assignment happens.

t ite

ratio

ns


The k-Means Clustering Method

Example : 2-means

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Find the centroids

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Reassign all data

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

Stop if stabilized

Find the centroids


Comments on the k-Means Method

Strength Relatively efficient: O(nkt), where n is # objects, k is #

clusters, and t is # iterations. Normally, k, t << n. Often terminates at a local optimum. The global

optimum may be found using techniques such as: simulated annealing and genetic algorithms

Weakness Applicable only when mean is defined, then what

about categorical data, like colors red, blue, green, …? Need to specify k, the number of clusters, in advance Sensitive to noisy data and outliers which may distort

the distribution of data Not suitable to discover clusters with non-convex

shapes


Variations of the k-Means Method

A few variants of the k-Means which differ in Selection of the k initial seed points Dissimilarity calculations Strategies to calculate cluster means

Handling categorical data: k-modes (Huang’98) Using new dissimilarity measures to deal with

categorical objects Replacing means of clusters with modes Using a frequency-based method to update

modes of clusters A mixture of categorical and numerical data: k-

prototype method


Representative Object-Based Technique

(The K-Medoids Clustering Method)

Find representative objects, called medoids, in clusters Medoid : the most centrally located data object of a cluster k-Means is sensitive to noisy data and outliers.

PAM (Partitioning Around Medoids, 1987) PAM works effectively for small data sets, but does not

scale well for large data sets : O( C(n,k) ) Improved algorithms :

CLARA (Kaufmann & Rousseeuw, 1990): Use sampling to find representatives of the data (in order to reduce data size)

CLARANS (Ng & Han, 1994): Use different samples for representatives searching at each run. Return the best representative set in a user-defined number of runs.

Centroid-Based


PAM (Partitioning Around Medoids) (1987)

PAM (Kaufman and Rousseeuw, 1987), built in S+ Use representative objects of a cluster as seed points.

1. Select k representative objects arbitrarily

2. Assign each non-medoid object to the most similar

representative object (n objects, k clusters)

3. Do a greedy search O(k(n-k)2) : for each pair of non-medoid

object h and medoid object i, calculate the total swapping

cost TCih (after cluster reassignment). Find the best

TCihBest .

If TCihBest

= (DTBest - DT)< 0 (where ),

then medoid object i is replaced by h, otherwise stop.

4. repeat steps 3 until there is no change

k

jCp

jT

j

opD1

||

t ite

ratio

ns


CLARA (Clustering Large Applications) (1990)

CLARA (Kaufmann and Rousseeuw in 1990)

Built in statistical analysis packages, such as S+ It draws a sample set of the data, applies PAM on

the samples, and gives the best clustering as the output

Strength: deals with larger data sets than PAM Weakness:

Efficiency depends on the sample size If the samples are biased, a good clustering based

on the samples will not necessarily represent a good clustering for the whole data set


CLARANS (“Randomized” CLARA) (1994)

(Clustering Large Application based on Randomized Search-Ng & Han’94) The PAM clustering process can be presented as searching a graph

where every node (a set of k medoids) is a potential solution : O(C(n,k)). At each step, all the neighbors of the current node are examined. The

current node is replaced by the neighbor with the best improvement.(Two nodes are neighbors if their sets differ by only one object. Every node has k(n-k)-1 neghbors.)

CLARA confine to a fixed sample of size m at each search stage. CLARANS dynamically draws a random sample of neighbors in each

search step. If a local optimum is found, CLARANS starts with a new sample of

neighbors in search for a new local optimum. Once a user-specified number of local minima has been found, CLARANS outputs the best one.

It is more efficient and scalable than both PAM and CLARA

mn





Hierarchical Clustering

Use distance matrix as clustering criteria. This method does not require the number of clusters k as an input, but may need a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

a a b

d e

c d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative(AGNES)

divisive(DIANA)


AGNES (Agglomerative Nesting)

Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,

S+ Use the Single-Link method and the dissimilarity

matrix. Merge nodes that have the least dissimilarity Go on in a non-descending fashion Eventually all nodes belong to the same cluster

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

… …

Organize data objects into a tree structure of hierarchical partitions (tree of clusters), called a dendrogram.

A clustering result of the data objects is obtained by cutting the dendrogram at the desired level, then each connected component forms a cluster.


Dendrogram (Shows How Clusters Merge Hierarchically)


Clustering Dendrogram Clustering Dendrogram in Gene Expression Analysisin Gene Expression Analysis

Finding differentially regulated genes

Clustering


DIANA (Divisive Analysis)

Introduced in Kaufmann and Rousseeuw (1990) Implemented in statistical analysis packages, e.g.,

S+ Inverse order of AGNES

Eventually each node (data object) forms a cluster on its own

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 100

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10


More on Hierarchical Clustering Methods Weakness of agglomerative clustering methods

do not scale well: time complexity of at least O(n2), where n is the number of data objects

can never undo what was done previously at each stage

Integration of hierarchical and distance-based clusterings BIRCH (1996): uses CF-tree and incrementally adjusts

the quality of sub-clusters (good for Big Data Analysis) CURE (1998): selects well-scattered points from each

cluster and then shrinks them towards the center of the cluster by a specified fraction

CHAMELEON (1999): hierarchical clustering using dynamic modeling


BIRCH (1996)

Birch: Balanced Iterative Reducing and Clustering using Hierarchies, by Zhang, Ramakrishnan, Livny (SIGMOD’96)

Incrementally construct a CF-tree (Clustering Feature Tree), a hierarchical data structure for multiphase clustering Phase 1: scan DB to build an initial in-memory CF-tree (a

hierarchical compression structure of the data that tries to preserve the inherent clustering structure of the data)

Phase 2: use an arbitrary (hierarchical) clustering algo. to cluster the data in each leaf node of the CF-tree

Scales linearly: finds a good clustering with a single scan and improves the quality with a few additional scans(A good feature for big data analysis)

Weakness: handles only numeric data, and sensitive to the input order of the data records.

Example


Clustering Feature of a sub-cluster: CF = (N, LS, SS)

N: Number of data vectors in the sub-cluster

LS: Ni=1Xi (linear sum)

SS: Ni=1Xi

2 (square sum)

BIRCH’s Clustering Feature Vector

0

1

2

3

4

5

6

7

8

9

10

0 1 2 3 4 5 6 7 8 9 10

CF = (5, (16,30),(54,190))

(3,4)(2,6)(4,5)(4,7)(3,8)


BIRCH’s Parameters & Their Meanings

A clustering feature is the summary statistics for the data in a sub-cluster The zero moment : the number of data vectors in a sub-

cluster The first moment : the linear sum on N data vectors The second moment : the square sum on N data vectors

BIRCH use two parameters for controlling node split in the construction process of a CF tree Branching factor : constrains the maximum number of child

nodes per non-leaf node (a leaf node is required to fit in a memory page of size P which determines the branching factor of a leaf node)

Threshold : constrains the maximum average distance of data pairs of a sub-cluster in the leaf nodes


CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

Branching factor = 7Threshold = 6

Root

Non-leaf node

Leaf node Leaf node

An object is inserted into the closed leaf entry. Split the node as needed

After an insertion, information about it is passed toward the root of tree

Similar to the construction of B+-tree


CURE (Clustering Using REpresentatives )

CURE: proposed by Guha, Rastogi & Shim, 1998A type of hierarchical clustering. Most clustering algorithms either favors clusters with

spherical shape and similar sizes, or are fragile in the present of outliers. CURE overcomes these problems.

Uses multiple representative points to represent a cluster. At each step of clustering, the two clusters with the closest pair of representative points are merged.

Adjusts well to arbitrary shaped clusters. The representative points of a cluster attempt to capture the data distribution and shape of the cluster.

Stops the creation of a cluster hierarchy until there are only k remaining clusters.


The Steps for large data set in CURE

Steps involved in clustering large data set using

CURE


Cure: The Algorithm(for large data set in CURE )

1. Phase 1 : Pre-clustering

a. Draw a random sample S of the original data

(if the size of the original data is very large)

b. Partition sample S into p partitions

(Partitioning for speedup)

c. Partially cluster each partition into |S|/(p╳q) clusters

(By CURE hierarchical clustering: starts with each input point as a separate cluster for some q>1)

d. Outlier elimination Most outliers are eliminated by random sampling (Step a)

If a cluster grows too slow, eliminate it.

1/2


Cure: The Algorithm

2. Phase 2 : find the final clusters (By CURE)

a. Cluster partial clusters (# of partial clusters |S|/q) Find well scattered representative points for the new cluster at

each iteration. The representative points of a cluster are used to compute its distance from other clusters.

The distance between two clusters is the distance between the closest pair of representative points. Merge the closest pair of clusters.

The representative points falling in each newly formed cluster are “shrinked” or moved toward the cluster center by a user-specified fraction .

The representative points capture the shape of the cluster

b. Mark the data with the corresponding cluster labels

2/2


Data Partitioning and Clustering

Sample size : s = 50 # of partitions : p = 2

x x

x

y

y y

y

x

y

x

Partition size : s/p = 25

cluster #: s/pq = 5


Representative Points Shrinking in Cure

Shrink the multiple representative points towards the gravity center by a fraction of .

Multiple representatives capture the shape of the cluster

x

y

x

y

Shrink

representatives

old cluster centernew cluster center


CHAMELEON

CHAMELEON: A type of hierarchical clustering, by G. Karypis, E.H. Han and V. Kumar’99

Measures the similarity based on a dynamic model Two clusters are merged only if the

interconnectivity and closeness (proximity) between two clusters are high relative to the internal interconnectivity of the clusters and closeness of items within the clusters

A two phase algorithm 1. Use a graph partitioning algorithm: cluster

objects into a large number of relatively small sub-clusters

2. Use an agglomerative hierarchical clustering algorithm: find the genuine clusters by repeatedly combining these sub-clusters


Overall Framework of CHAMELEON

Construct

Sparse Graph

Partition the Graph

By cutting long edges

Merge Partition

Final Clusters

Data Set

K-nearest neighbor graph

According to relative

interconnectivity and

relative closeness





Density-Based Clustering Methods

Clustering based on density (local cluster criterion, such as density-connected points)

Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters as termination

condition Several interesting studies:

DBSCAN: Ester, et al. (KDD’96) OPTICS: Ankerst, et al (SIGMOD’99). DENCLUE: Hinneburg & D. Keim (KDD’98)


Density-Based Clustering: Background (I)

Two density parameters: ε : maximum neighborhood radius of a core point MinPts: min. number of points in an ε-

neighborhood of a core point Nε(p) :{q belongs to D | dist(p,q) <= ε}

D : data space, Nε(p) : neighbors of point p Directly density-reachable: A point p is directly

density-reachable from a point q wrt. ε, MinPts if

1) p belongs to Nε(q) 2) |Nε(q)| >= MinPts

(q is a core point)

pq

MinPts = 5ε = 1 cm


Density-Based Clustering: Background (II)

Density-reachable: A point p is density-reachable

from a point q wrt. ε, MinPts if there is a chain of points p1, …, pn, p1 = q, pn = p such that pi+1 is directly density-reachable from pi

Density-connected A point p is density-connected to

a point q wrt. ε, MinPts if there is a point o such that both, p and q are density-reachable from o wrt. ε and MinPts.

p

qp1

p q

o


DBSCAN: Density Based Spatial Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster is defined as a maximal set of density-connected points

If the ε-neighborhood of an object contains at least MinPts of objects, then the object is called a core object

core point

border point

Outlier

ε = 1cm

MinPts = 5


DBSCAN: The Algorithm

Arbitrary select a point p

Retrieve all points density-reachable from p wrt ε and MinPts.

If p is a core point, a cluster is formed.

If p is a border point, no points are density-reachable from p and DBSCAN visits the next point of the database.

Continue the process until all of the points have been processed.





Grid-Based Clustering Method

Using multi-resolution grid data structure Several interesting methods

STING (a STatistical INformation Grid approach) by Wang, Yang and Muntz (1997)

WaveCluster by Sheikholeslami, Chatterjee, and Zhang (VLDB’98)

A multi-resolution clustering approach using wavelet method

CLIQUE: Agrawal, et al. (SIGMOD’98)


STING: A Statistical Information Grid Approach (1)

Wang, Yang and Muntz (VLDB’97) The spatial area is divided into rectangular cells There are several levels of cells corresponding to

different levels of resolution (1→4→16→64→‥‥)

Each cell at a high level is partitioned into a number of smaller cells in the next lower level

Statistical info of each cell is calculated and stored beforehand and is used to answer queries

Parameters of higher level cells can be easily calculated from parameters of lower level cell count, mean, s, min, max type of distribution—normal, uniform, etc.

Use a top-down approach to answer spatial data queries Start from a pre-selected layer—typically with a small

number of cells For each cell in the current level compute the confidence

interval


Remove the irrelevant cells from further consideration

When finish examining the current layer, proceed to the next lower level

Repeat this process until the bottom layer is reached

Advantages: Query-independent, easy to parallelize,

incremental update O(K), where K is the number of grid cells at the

lowest level Disadvantages:

All the cluster boundaries are either horizontal or vertical, and no diagonal boundary is detected



CLIQUE (Clustering In QUEst)(Clustering for High-Dimension Space)

Agrawal, Gehrke, Gunopulos, Raghavan (SIGMOD’98).

Automatically identifying subspaces of a high dimensional data space that allow better clustering than original space

CLIQUE can be considered as both density-based and grid-based

It partitions each dimension into the same number of equal length units

A unit is dense if the fraction of total data points contained in the unit exceeds the input model parameter

A cluster is a maximal set of connected dense units within a subspace of higher dimensionality


CLIQUE: The Major Steps

Identify dense units of higher dimensionality : Partition the data space into units and find the

number of points that lie inside each unit. Determine dense units in all 1-D dimensions of

interests. Identify dense units of higher dimensionality

using the Apriori principle (if a k-D unit is dense, then its (k-1)-D projection units must be dense)

Generate a minimal description for each cluster Determine the maximal region that cover a set of

connected dense units for the cluster Determination of minimal cover for each cluster


Sala

ry

(10,

000)

20 30 40 50 60age

54

31

26

70

20 30 40 50 60age

54

31

26

70

Vac

atio

n(w

eek)

age

Vac

atio

n

Salary 30 50

= 3

maximal region


Strength and Weakness of CLIQUE

Strength It automatically finds subspaces of the highest

dimensionality s.t. high density clusters exist in those subspaces

It is insensitive to the order of records in input and does not presume some canonical data distribution

It scales linearly with the size of input and has good scalability as the number of dimensions in the data increases

Weakness The accuracy of the clustering result may be

degraded at the expense of simplicity of the method





Model-Based Clustering Methods

Attempt to optimize the fit between the data and a given cluster models

Conceptual clustering (a statistical and AI approach) Clustering is performed first, followed by characterization

Produces a classification scheme for a set of unlabeled objects

Finds a conceptual characteristic description for each class/cluster

COBWEB (Fisher’87) A popular and simple method of incremental conceptual clustering

Creates a hierarchical clustering in the form of a classification tree

Each node denotes a concept and contains a probabilistic description of that concept


COBWEB Clustering Method

A classification treeanimalP(C0) = 1. 0P(scales|C0) = 0.25. . .

fishP(C1) = 0.25P(scales|C1) = 1. 0. . .

amphibianP(C2) = 0.25P(moist|C2) = 1. 0. . .

Mammal/birdP(C3) = 0.5P(hair|C3) = 0.5. . .

MammalP(C4) = 0.25P(hair|C4) = 1. 0. . .

birdP(C5) = 0.25P(feathers|C5) = 1. 0. . .


Other Model-Based Clustering Methods

Neural network approaches Represent each cluster as an exemplar, acting as a

“prototype” of the cluster In the clustering process, an input data object is

assigned to a cluster whose exemplar is the most similar to the data object according to some distance measure

They are based on competitive learning Involving an organized architecture of neuron units The neuron units compete in a “winner-takes-all”

fashion for current input training data object


Self-Organizing Feature Maps (SOMs)

(A Neural Network Approach)

Example Clustering is performed by having output

neurons competing for the current input data The neuron whose connection weight vector

is closest to the current training object wins The winner and its neighbors learn by having

their connection weights adjusted SOMs are believed to resemble the

processing that can occur in the human brain Also useful for visualizing high-dimensional

data mapped into 2- or 3-D space

Example


An Example of SOM Architecture

An example of SOM network• 2 input neurons• 16 output neurons (clusters)• Connection weight matrix : 16╳2• Weight vector of neuron i : <wi1 , wi2>

• Format for input X : <x1 , x2>

Input neurons

Output neurons(4╳4 architecture)

Connections

neuron i

SOM Output Map Example


Examples of SOM Feature Map


Mathematical Model of An SOM Example(A 3-input, 25-output SOM)

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

InputConnection weights

and competitive layer

a25╳1 = compet (W25╳3X3╳1)

125325

13

253

X n a

125

Feature Map (Output)

x1

x2

x3

a1

ai

a25

.

.

.

.

.

.


Weights Update & Neighborhood in SOM

Update weight vectors in a neighborhood of the winning neuron i*

N 13 1 8 12 13 14 18 =

N 13 2 3 7 8 9 11 12 13 14 15 17 18 19 23 =

- - - - -

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

12 14 1513

1

11

2 3 4 5

6 7 8 9 10

17 19 201816

22 24 252321

)1(13N )2(13N

,))()(()()1( pWpXpWpW iii i Ni d

• X(p) : the training vector at iteration p

}|{)(,** ddjdNjii

• Example: 13 is the winning neuron

•

•


Step 1Step 1:: InitialisationInitialisation..

Set initial connection weights to small random Set initial connection weights to small random values, say in an interval [0, 1], and assign a small values, say in an interval [0, 1], and assign a small positive value to the positive value to the learning rate parameter learning rate parameter ..

SOM Learning Algorithm


Step 2Step 2:: Find the winning neuron (iteration p) (Activation and Similarity Matching)

Activate the Kohonen network by applying an input vector Activate the Kohonen network by applying an input vector XX, , and find the winning (and find the winning (best matchingbest matching) neuron ) neuron i *, using the , using the criterion of minimal Euclidean distance criterion of minimal Euclidean distance as follows:as follows:

where , where , nn is the # of input neurons, and is the # of input neurons, and mm is the # of neurons in the Kohonen layer.is the # of neurons in the Kohonen layer.

2/1

1

2* ][ )()()(

n

jijj

m

ii

m

ipwxminpminpi WX


nxxxX ...,, ,21


Step 3Step 3:: Learning.

Update the connection weights between input neuron Update the connection weights between input neuron jj and and output neuron output neuron i :i :

where where wwijij((pp) is the weight correction at iteration ) is the weight correction at iteration pp..

The weight correction is determined by the competitive learning The weight correction is determined by the competitive learning rule:rule:

where where • is the is the learning ratelearning rate parameter, and parameter, and • NNi*i*(d, (d, pp)) is the is the neighbourhood functionneighbourhood function centred around the centred around the

winning neuron winning neuron i * at a distance at a distance dd at iteration at iteration pp..

)()()1( pwpwpw ijijij

)p,d(Ni,

)p,d(Ni,)p(wx)p(w

*

*

i

iijj

ij

][ 0


1≤ i≤m, 1≤ j≤n


Step 4Step 4:: IterationIteration..

Increase iteration Increase iteration pp by one, go back to Step 2 and by one, go back to Step 2 and continue until continue until the criterion ofthe criterion of minimal Euclidean distanceminimal Euclidean distance is is

satisfied, or satisfied, or no noticeable changes occur in the feature map.no noticeable changes occur in the feature map.



The overall effect of the SOM learning rule resides in The overall effect of the SOM learning rule resides in

moving the connection weight vector moving the connection weight vector Wi of the trained of the trained

output neuron output neuron ii towards the input pattern towards the input pattern X. .

Learning in Kohonen Network

X

X- Wi

ΔWi = α (X- Wi)

Wi+ΔWiinput patterninput pattern

Wi

connection weight of neuron connection weight of neuron ii

))()(()(),()()1( pWpXpWpWpWpW iiiii

-Wi


Convergence Process of SOM

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0.4

0.6

0.8

1

100 iterations

1000 iterations

2000 iterations

5000 iterations


What Is Outlier Discovery?

What are outliers? The set of objects are considerably dissimilar

from the remainder of the data Example: Sports: Michael Jordan, ...

Problem Find top n outlier points

Applications: Credit card fraud detection Telecom fraud detection Customer segmentation Medical analysis


Outlier Discovery: Statistical Approaches

Assume a model underlying distribution that generates data set (e.g. normal distribution) Use discordancy tests depending on

data distribution distribution parameter (e.g., mean, variance) number of expected outliers

Drawbacks most tests are for single attribute In many cases, data distribution may not be known

Statistical Approaches

Outlier Discovery: Distance-Based Approach

Introduced to counter the main limitations imposed by statistical methods We need multi-dimensional analysis without

knowing data distribution. Distance-based outlier: A DB(p, D)-outlier is an

object O in a dataset T s.t. at least a fraction p of the objects in T lies at a distance greater than D from O

Algorithms for mining distance-based outliers Index-based algorithm Nested-loop algorithm Cell-based algorithm


Outlier Discovery: Deviation-Based Approach

Identifies outliers by examining the main characteristics of objects in a group

Objects that “deviate” from this description are considered outliers

sequential exception technique simulates the way in which humans can

distinguish unusual objects from a series of similar objects

OLAP data cube technique uses data cubes to identify regions of

anomalies in a large multi-dimensional data set


Summary

Cluster analysis groups data objects based on their similarity and has wide applications

Measure of similarity can be computed for various types of data

Clustering algorithms can be categorized into partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods

Outlier detection and analysis are very useful for fraud detection and can be performed by statistical, distance-based or deviation-based approaches

There are still lots of research issues on cluster analysis, such as constraint-based clustering

Documents

2016年1月5日星期二 2016年1月5日星期二 2016年1月5日星期二 Data Mining: Concepts and Techniques1 Cluster Analysis (Clustering) ©Jiawei Han and Micheline Kamber Intelligent