55
CS685 : Special Topics in Data Mining, UKY The UNIVERSITY of KENTUCKY Semi-Supervised Clustering CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu

Semi-Supervised Clustering

  • Upload
    micah

  • View
    126

  • Download
    1

Embed Size (px)

DESCRIPTION

Semi-Supervised Clustering. CS 685: Special Topics in Data Mining Spring 2008 Jinze Liu. Outline of lecture. Overview of clustering and classification What is semi-supervised learning? Semi-supervised clustering Semi-supervised classification Semi-supervised clustering - PowerPoint PPT Presentation

Citation preview

Page 1: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

The UNIVERSITY of KENTUCKY

Semi-Supervised Clustering

CS 685: Special Topics in Data MiningSpring 2008

Jinze Liu

Page 2: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Outline of lecture

• Overview of clustering and classification

• What is semi-supervised learning?– Semi-supervised clustering– Semi-supervised classification

• Semi-supervised clustering– What is semi-supervised clustering?– Why semi-supervised clustering?– Semi-supervised clustering algorithms

Page 3: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Supervised classification versus unsupervised clustering

• Unsupervised clustering – Group similar objects together to find clusters

• Minimize intra-class distance• Maximize inter-class distance

• Supervised classification– Class label for each training sample is given– Build a model from the training data– Predict class label on unseen future data points

Page 4: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

What is clustering?

• Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups

Inter-cluster distances are maximized

Intra-cluster distances are

minimized

Page 5: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

What is Classification?

Apply Model

Induction

Deduction

Learn Model

Model

Tid Attrib1 Attrib2 Attrib3 Class

1 Yes Large 125K No

2 No Medium 100K No

3 No Small 70K No

4 Yes Medium 120K No

5 No Large 95K Yes

6 No Medium 60K No

7 Yes Large 220K No

8 No Small 85K Yes

9 No Medium 75K No

10 No Small 90K Yes 10

Tid Attrib1 Attrib2 Attrib3 Class

11 No Small 55K ?

12 Yes Medium 80K ?

13 Yes Large 110K ?

14 No Small 95K ?

15 No Large 67K ? 10

Test Set

Learningalgorithm

Training Set

Page 6: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Clustering algorithms

• K-Means

• Hierarchical clustering

• Graph based clustering (Spectral clustering)

Page 7: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Classification algorithms

• Decision Trees• Naïve Bayes classifier• Support Vector Machines (SVM)• K-Nearest-Neighbor classifiers• Logistic Regression• Neural Networks• Linear Discriminant Analysis (LDA)

Page 8: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Supervised Classification Example

.

..

.

Page 9: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.Supervised Classification Example

.. ..

. .. ..

.

....

.

.

. ..

Page 10: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.Supervised Classification Example

.. ..

. .. ..

.

....

.

.

. ..

Page 11: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.Unsupervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 12: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.Unsupervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 13: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-Supervised Learning

• Combines labeled and unlabeled data during training to improve performance:– Semi-supervised classification: Training on labeled data

exploits additional unlabeled data, frequently resulting in a more accurate classifier.

– Semi-supervised clustering: Uses small amount of labeled data to aid and bias the clustering of unlabeled data.

Unsupervisedclustering

Semi-supervisedlearning

Supervisedclassification

Page 14: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Semi-Supervised Classification Example

.. ..

. .. ..

.

....

.

.

. ..

Page 15: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Semi-Supervised Classification Example

.. ..

. .. ..

.

....

.

.

. ..

Page 16: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-Supervised Classification

• Algorithms:– Semisupervised EM [Ghahramani:NIPS94,Nigam:ML00].– Co-training [Blum:COLT98].– Transductive SVM’s [Vapnik:98,Joachims:ICML99].– Graph based algorithms

• Assumptions:– Known, fixed set of categories given in the labeled data.– Goal is to improve classification of examples into these

known categories.• More discussions next week

Page 17: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Semi-Supervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 18: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Semi-Supervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 19: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Second Semi-Supervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 20: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

.

Second Semi-Supervised Clustering Example

.. ..

. .. ..

.

....

.

.

. ..

Page 21: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-supervised clustering: problem definition

• Input:– A set of unlabeled objects, each described by a set of attributes

(numeric and/or categorical)– A small amount of domain knowledge

• Output:– A partitioning of the objects into k clusters (possibly with some

discarded as outliers)

• Objective:– Maximum intra-cluster similarity– Minimum inter-cluster similarity– High consistency between the partitioning and the domain

knowledge

Page 22: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Why semi-supervised clustering?• Why not clustering?

– The clusters produced may not be the ones required.– Sometimes there are multiple possible groupings.

• Why not classification?– Sometimes there are insufficient labeled data.

• Potential applications– Bioinformatics (gene and protein clustering)– Document hierarchy construction– News/email categorization– Image categorization

Page 23: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-Supervised Clustering

• Domain knowledge– Partial label information is given– Apply some constraints (must-links and cannot-links)

• Approaches– Search-based Semi-Supervised Clustering

• Alter the clustering algorithm using the constraints

– Similarity-based Semi-Supervised Clustering• Alter the similarity measure based on the constraints

– Combination of both

This class

Next class

Page 24: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Search-Based Semi-Supervised Clustering

• Alter the clustering algorithm that searches for a good partitioning by:– Modifying the objective function to give a reward for

obeying labels on the supervised data [Demeriz:ANNIE99].– Enforcing constraints (must-link, cannot-link) on the

labeled data during clustering [Wagstaff:ICML00, Wagstaff:ICML01].

– Use the labeled data to initialize clusters in an iterative refinement algorithm (kMeans,) [Basu:ICML02].

Page 25: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Overview of K-Means Clustering

• K-Means is a partitional clustering algorithm based on iterative relocation that partitions a dataset into K clusters.

Algorithm: Initialize K cluster centers randomly. Repeat until convergence:– Cluster Assignment Step: Assign each data point x to the cluster Xl, such

that L2 distance of x from (center of Xl) is minimum– Center Re-estimation Step: Re-estimate each cluster center as the

mean of the points in that cluster

}{1l

K

l

ll

Page 26: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K-Means Objective Function

• Locally minimizes sum of squared distance between the data points and their corresponding cluster centers:

• Initialization of K cluster centers:– Totally random– Random perturbation from global mean– Heuristic to ensure well-separated centers

2

1||||

K

l Xx lilix

Page 27: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means Example

Page 28: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRandomly Initialize Means

x

x

Page 29: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleAssign Points to Clusters

x

x

Page 30: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRe-estimate Means

x

x

Page 31: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRe-assign Points to Clusters

x

x

Page 32: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRe-estimate Means

x

x

Page 33: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRe-assign Points to Clusters

x

x

Page 34: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

K Means ExampleRe-estimate Means and Converge

x

x

Page 35: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-Supervised K-Means

• Partial label information is given– Seeded K-Means– Constrained K-Means

• Constraints (Must-link, Cannot-link)– COP K-Means

Page 36: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Semi-Supervised K-Means for partially labeled data

• Seeded K-Means:– Labeled data provided by user are used for initialization: initial center

for cluster i is the mean of the seed points having label i.– Seed points are only used for initialization, and not in subsequent

steps.

• Constrained K-Means:– Labeled data provided by user are used to initialize K-Means

algorithm.– Cluster labels of seed data are kept unchanged in the cluster

assignment steps, and only the labels of the non-seed data are re-estimated.

Page 37: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points may change.

Page 38: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means Example

Page 39: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means ExampleInitialize Means Using Labeled Data

xx

Page 40: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means ExampleAssign Points to Clusters

xx

Page 41: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means ExampleRe-estimate Means

xx

Page 42: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Seeded K-Means ExampleAssign points to clusters and Converge

xx the label is changed

Page 43: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Constrained K-Means

Use labeled data to find the initial centroids andthen run K-Means.

The labels for seeded points will not change.

Page 44: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Constrained K-Means Example

Page 45: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Constrained K-Means ExampleInitialize Means Using Labeled Data

xx

Page 46: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Constrained K-Means ExampleAssign Points to Clusters

xx

Page 47: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Constrained K-Means ExampleRe-estimate Means and Converge

xx

Page 48: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

COP K-Means

• COP K-Means [Wagstaff et al.: ICML01] is K-Means with must-link (must be in same cluster) and cannot-link (cannot be in same cluster) constraints on data points.

• Initialization: Cluster centers are chosen randomly, but as each one is chosen any must-link constraints that it participates in are enforced (so that they cannot later be chosen as the center of another cluster).

• Algorithm: During cluster assignment step in COP-K-Means, a point is assigned to its nearest cluster without violating any of its constraints. If no such assignment exists, abort.

Page 49: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

COP K-Means Algorithm

Page 50: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Illustration

xx

Must-link

Determineits label

Assign to the red class

Page 51: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Illustration

xxCannot-link

Determineits label

Assign to the red class

Page 52: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Illustration

xxCannot-link

Determineits label

The clustering algorithm fails

Must-link

Page 53: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Summary

• Seeded and Constrained K-Means: partially labeled data• COP K-Means: constraints (Must-link and Cannot-link)

• Constrained K-Means and COP K-Means require all the constraints to be satisfied.– May not be effective if the seeds contain noise.

• Seeded K-Means use the seeds only in the first step to determine the initial centroids.– Less sensitive to the noise in the seeds.

• Experiments show that semi-supervised k-Means outperform traditional K-Means.

Page 54: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Reference

• Semi-supervised Clustering by Seeding – http://www.cs.utexas.edu/users/ml/papers/semi-

icml-02.pdf

• Constrained K-means clustering with background knowledge– http://www.litech.org/~wkiri/Papers/wagstaff-km

eans-01.pdf

Page 55: Semi-Supervised Clustering

CS685 : Special Topics in Data Mining, UKY

Next class

• Topics– Similarity-based semi-supervised clustering

• Readings– Distance metric learning, with application to clustering

with side-information• http://ai.stanford.edu/~ang/papers/nips02-metric.pdf

– From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering

• http://www.cs.berkeley.edu/~klein/papers/constrained_clustering-ICML_2002.pdf