Upload
micah
View
126
Download
1
Embed Size (px)
DESCRIPTION
Semi-Supervised Clustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Outline of lecture. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering - PowerPoint PPT Presentation
Citation preview
CS685 : Special Topics in Data Mining, UKY
The UNIVERSITY of KENTUCKY
Semi-Supervised Clustering
CS 685: Special Topics in Data MiningSpring 2008
Jinze Liu
CS685 : Special Topics in Data Mining, UKY
Outline of lecture
• Overview of clustering and classification
• What is semi-supervised learning?– Semi-supervised clustering– Semi-supervised classification
• Semi-supervised clustering– What is semi-supervised clustering?– Why semi-supervised clustering?– Semi-supervised clustering algorithms
CS685 : Special Topics in Data Mining, UKY
Supervised classification versus unsupervised clustering
• Unsupervised clustering – Group similar objects together to find clusters
• Minimize intra-class distance• Maximize inter-class distance
• Supervised classification– Class label for each training sample is given– Build a model from the training data– Predict class label on unseen future data points
CS685 : Special Topics in Data Mining, UKY
What is clustering?
• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups
Inter-cluster distances are maximized
Intra-cluster distances are
minimized
CS685 : Special Topics in Data Mining, UKY
What is Classification?
Apply Model
Induction
Deduction
Learn Model
Model
Tid Attrib1 Attrib2 Attrib3 Class
1 Yes Large 125K No
2 No Medium 100K No
3 No Small 70K No
4 Yes Medium 120K No
5 No Large 95K Yes
6 No Medium 60K No
7 Yes Large 220K No
8 No Small 85K Yes
9 No Medium 75K No
10 No Small 90K Yes 10
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
14 No Small 95K ?
15 No Large 67K ? 10
Test Set
Learningalgorithm
Training Set
CS685 : Special Topics in Data Mining, UKY
Clustering algorithms
• K-Means
• Hierarchical clustering
• Graph based clustering (Spectral clustering)
CS685 : Special Topics in Data Mining, UKY
Classification algorithms
• Decision Trees• Naïve Bayes classifier• Support Vector Machines (SVM)• K-Nearest-Neighbor classifiers• Logistic Regression• Neural Networks• Linear Discriminant Analysis (LDA)
CS685 : Special Topics in Data Mining, UKY
Supervised Classification Example
.
..
.
CS685 : Special Topics in Data Mining, UKY
.Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Unsupervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.Unsupervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Learning
• Combines labeled and unlabeled data during training to improve performance:– Semi-supervised classification: Training on labeled data
exploits additional unlabeled data, frequently resulting in a more accurate classifier.
– Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.
Unsupervisedclustering
Semi-supervisedlearning
Supervisedclassification
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Classification Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Classification
• Algorithms:– Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].– Co-training [Blum:COLT98].– Transductive SVM’s [Vapnik:98,Joachims:ICML99].– Graph based algorithms
• Assumptions:– Known, fixed set of categories given in the labeled data.– Goal is to improve classification of examples into these
known categories.• More discussions next week
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Second Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
.
Second Semi-Supervised Clustering Example
.. ..
. .. ..
.
....
.
.
. ..
CS685 : Special Topics in Data Mining, UKY
Semi-supervised clustering: problem definition
• Input:– A set of unlabeled objects, each described by a set of attributes
(numeric and/or categorical)– A small amount of domain knowledge
• Output:– A partitioning of the objects into k clusters (possibly with some
discarded as outliers)
• Objective:– Maximum intra-cluster similarity– Minimum inter-cluster similarity– High consistency between the partitioning and the domain
knowledge
CS685 : Special Topics in Data Mining, UKY
Why semi-supervised clustering?• Why not clustering?
– The clusters produced may not be the ones required.– Sometimes there are multiple possible groupings.
• Why not classification?– Sometimes there are insufficient labeled data.
• Potential applications– Bioinformatics (gene and protein clustering)– Document hierarchy construction– News/email categorization– Image categorization
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised Clustering
• Domain knowledge– Partial label information is given– Apply some constraints (must-links and cannot-links)
• Approaches– Search-based Semi-Supervised Clustering
• Alter the clustering algorithm using the constraints
– Similarity-based Semi-Supervised Clustering• Alter the similarity measure based on the constraints
– Combination of both
This class
Next class
CS685 : Special Topics in Data Mining, UKY
Search-Based Semi-Supervised Clustering
• Alter the clustering algorithm that searches for a good partitioning by:– Modifying the objective function to give a reward for
obeying labels on the supervised data [Demeriz:ANNIE99].– Enforcing constraints (must-link, cannot-link) on the
labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].
– Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].
CS685 : Special Topics in Data Mining, UKY
Overview of K-Means Clustering
• K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters.
Algorithm: Initialize K cluster centers randomly. Repeat until convergence:– Cluster Assignment Step: Assign each data point x to the cluster Xl, such
that L2 distance of x from (center of Xl) is minimum– Center Re-estimation Step: Re-estimate each cluster center as the
mean of the points in that cluster
}{1l
K
l
ll
CS685 : Special Topics in Data Mining, UKY
K-Means Objective Function
• Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:
• Initialization of K cluster centers:– Totally random– Random perturbation from global mean– Heuristic to ensure well-separated centers
2
1||||
K
l Xx lilix
CS685 : Special Topics in Data Mining, UKY
K Means Example
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRandomly Initialize Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleAssign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-assign Points to Clusters
x
x
CS685 : Special Topics in Data Mining, UKY
K Means ExampleRe-estimate Means and Converge
x
x
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means
• Partial label information is given– Seeded K-Means– Constrained K-Means
• Constraints (Must-link, Cannot-link)– COP K-Means
CS685 : Special Topics in Data Mining, UKY
Semi-Supervised K-Means for partially labeled data
• Seeded K-Means:– Labeled data provided by user are used for initialization: initial center
for cluster i is the mean of the seed points having label i.– Seed points are only used for initialization, and not in subsequent
steps.
• Constrained K-Means:– Labeled data provided by user are used to initialize K-Means
algorithm.– Cluster labels of seed data are kept unchanged in the cluster
assignment steps, and only the labels of the non-seed data are re-estimated.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points may change.
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means Example
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleInitialize Means Using Labeled Data
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleAssign Points to Clusters
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleRe-estimate Means
xx
CS685 : Special Topics in Data Mining, UKY
Seeded K-Means ExampleAssign points to clusters and Converge
xx the label is changed
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means
Use labeled data to find the initial centroids andthen run K-Means.
The labels for seeded points will not change.
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means Example
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleInitialize Means Using Labeled Data
xx
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleAssign Points to Clusters
xx
CS685 : Special Topics in Data Mining, UKY
Constrained K-Means ExampleRe-estimate Means and Converge
xx
CS685 : Special Topics in Data Mining, UKY
COP K-Means
• COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.
• Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster).
• Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.
CS685 : Special Topics in Data Mining, UKY
COP K-Means Algorithm
CS685 : Special Topics in Data Mining, UKY
Illustration
xx
Must-link
Determineits label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
xxCannot-link
Determineits label
Assign to the red class
CS685 : Special Topics in Data Mining, UKY
Illustration
xxCannot-link
Determineits label
The clustering algorithm fails
Must-link
CS685 : Special Topics in Data Mining, UKY
Summary
• Seeded and Constrained K-Means: partially labeled data• COP K-Means: constraints (Must-link and Cannot-link)
• Constrained K-Means and COP K-Means require all the constraints to be satisfied.– May not be effective if the seeds contain noise.
• Seeded K-Means use the seeds only in the first step to determine the initial centroids.– Less sensitive to the noise in the seeds.
• Experiments show that semi-supervised k-Means outperform traditional K-Means.
CS685 : Special Topics in Data Mining, UKY
Reference
• Semi-supervised Clustering by Seeding – http://www.cs.utexas.edu/users/ml/papers/semi-
icml-02.pdf
• Constrained K-means clustering with background knowledge– http://www.litech.org/~wkiri/Papers/wagstaff-km
eans-01.pdf
CS685 : Special Topics in Data Mining, UKY
Next class
• Topics– Similarity-based semi-supervised clustering
• Readings– Distance metric learning, with application to clustering
with side-information• http://ai.stanford.edu/~ang/papers/nips02-metric.pdf
– From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering
• http://www.cs.berkeley.edu/~klein/papers/constrained_clustering-ICML_2002.pdf