1Intelligent Database Systems Lab
國立雲林科技大學National Yunlin University of Science and Technology
A genetic clustering algorithm for data with non-spherical-shape clusters
Advisor : Dr. Hsu Graduate : Sheng-Hsuan Wang Authors : Lin Yu Tseng
Shiueng Bien Yang
Department of Information Management
Pattern Recognition 33 (2000) 1251-1259
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline
Motivation Objective Introduction The basic concept of genetic strategy The genetic clustering algorithm Experiments Concluding remarks and Summary Personal opinions Review
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivation
Some problems of the clustering. The number of clusters? The threshold distance d in neighborhood clustering. Non-spherical-shape clusters.
4
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective
To solve the problem of these traditional clustering algorithm.
A genetic clustering algorithm for clustering. Non-spherical-shape clusters. According to the similarities and automatically find the pr
oper k.
5
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
These clustering methods can broadly be classified into two categories: Hierarchical
agglomerative divisive
Non-hierarchical k-means
6
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Introduction
The problems in most of these clustering algorithms The number of clusters? Non-spherical shape cluster? The threshold of distance for merge?
GA clustering algorithm Searching, as same as clustering.
7
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Basic concept of Classical Genetic Algorithm
Encoding schemas
Fitness evaluation
Testing the end of the algorithm
Parent selection
Crossover operators
Mutation operators
NO Halt
YES
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
The genetic clustering algorithm
The algorithm CLUSTERING consists of two stages
First stage
Nearest Neighbor
C1, C2, …, Cm
n objects,
O1, O2, …, On
Second stage
GA clustering
merge
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
First Stage
Step 1: find the nearest neighbor of each object Oi.
Step 2: dav, the average of the nearest neighbor distances.
The mean of u ?
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
First Stage
Step 3: compute the adjacency matrix Anxn.
Step 4: connected components be denoted by
C1, C2, …, Cm.
nij
otherwise
dOOifjiA ji
1 where
,
||||,
0
1),(
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Second Stage
The initialization step Population Coding Dinter and Dintra
The three phases of GA Reproduction phase Crossover phase Mutation phase
Encoding schemas
Fitness evaluation
Testing the end of the algorithm
Parent selection
Crossover operators
Mutation operators
NO Halt
YES
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Second Stage
Distance matrix Dmxm of each pair of cluster Ci and Cj.
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Second Stage
The initialization step Population: 50 strings. The length of each string is m:
{C1, C2, …, Cm}
For each string Ri, two sets Ui and U’i are defined
1 1 1 0 0
R1
1 0 1 1 0
R2
m
U1={C1, C2, C3} ; U’1={C4, C5}
U2={C1, C3, C4} ; U’2={C2, C5}
14
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Second Stage
Intra-distance Dintra and the inter-distance Dinter
U1={C1, C2, C3} ; U’1={C4, C5, C7}
15
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Second Stage
Reproduction phase Fitness function
SCORE(Ri) = Dinter(Ri)*w – Dintra(Ri), w within [1,3]. Reproducted probability
Crossover phase pc = 0.8.
Mutation phase pm = 0.1.
R1 1 1 1 0 0R2 1 0 1 1 0
N
iii RSCORERSCORE
1
)(/)(
16
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Merge_Sets_Finding Algorithm
Step 1: Sort the fitness of the strings.
Step 2: Choose Ri.
Step 3: Choose smallest l > i such that .IF no such l exists THEN go to Step 4(discarded)
ELSE i = l and go to Step 2(merge)
Step 4: End.
)(...)()( 21 NRSCORERSCORERSCORE
R1={C1, C2, C3}
R2={C3, C4, C6}
R3={C4, C5}iUUU
Ui ;1
UU l
17
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
Noise : distance > 2dav
Original
18
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
u=1.2, 8 clusters
7 clusters
19
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
6 clusters u=1.5 or 2, 5 clusters
20
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
u=1.2, w=2,
4 clusters (best)
3 clusters
21
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
2 clusters 4 clusters (direct GA)
22
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 1
4 clusters (k-mean)
23
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 2
Original
4 clusters
3 clusters
2 clusters
24
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experiments - 3
Original
4 clusters
25
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Concluding and Summary
A genetic clustering algorithm CLUSTERING Non-spherical shape. Automatic clustering. Binary searching the proper interval for w.
26
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinions The proper number of cluster decide by the value of w.
27
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Review
Using GCA to automatic clustering. Split : NN. Merge : Merge_Sets_Finding Algorithm.