View
217
Download
0
Category
Preview:
Citation preview
Objectives
After finishing this class the
students will:
Get the overview of
Clustering problems
Know and understand
algorithms for solving the
clustering problems
Clustering Problem:
Given a database D={t1,t2,…,tn} of tuples
and an integer value k, the Clustering
Problem is to define a mapping f:Dg{1,..,k}
where each ti is assigned to one cluster
Kj, 1<=j<=k.
Clustering Problem:
A clustering is a set of clusters
No prior knowledge
Number of clusters
Meaning of clusters
Unsupervised learning
Not A Clustering Problem:
Supervised classification
Have class label information
Simple segmentation
Dividing students into different registration
groups alphabetically, by last name
Not A Clustering Problem:
Results of a query
Groupings are a result of an external
specification
Graph partitioning
Some mutual relevance and synergy, but
areas are not identical
Types of Clustering Methods
Partitioning Methods
The simplest and the most fundamental in
clustering analysis
Construct various partitions and then
evaluate them by some criterion
Find the exclusive clusters of spherical shape
Distance based
Effective for small and medium size datasets
Types of Clustering Methods
Hierarchical Methods
Create a hierarchical decomposition (i.e., multiple
levels) of the set of data (or objects) using some
criterion
The bottom-up approach starts with object
forming a separate group, merges the objects or
groups close to one another until all the groups
are merged into one or any termination condition
holds.
Types of Clustering Methods
Density-Based Methods
Based on connectivity and density functions
Able to find arbitrarily shaped clusters
Clusters are the dense regions of objects in space
that are separated by low-density regions
Types of Clustering Methods
Density-Based Methods
The cluster density requires each point to have a
minimum number of points within in
“neighborhood”
May filter out outliers
Types of Clustering Methods
Grid-Based Methods
Quantize the object space into a finite number of
cells that form a grid structure
Fast processing time
k – Means Algorithm
Assumes Euclidean space
Start by picking k, the number of clusters
Initialize clusters by picking one point per
cluster
Example: pick one point at random, then k -1
other points, each as far away as possible from
the previous points
k – Means Algorithm: Populating
Clusters
For each point, place it in the cluster whose
current centroid it is nearest
After all points are assigned, fix the centroids
of the k clusters
Optional: reassign all points to their closest
centroid
Sometimes moves points between clusters
Reference
M.H.Dunham, Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002
J. Han, M. Kamber, Pei, J., Data Mining: Concepts and Techniques, Elsevier, 2012
Tan, P.-N., Steinbach, M., Kumar, V.,
Introduction to Data Mining, Pearson
International, 2005
Recommended