Upload
martha-morton
View
221
Download
1
Embed Size (px)
Citation preview
Feature Selection in k-Median Clustering
Olvi Mangasarian and Edward Wild
University of Wisconsin - Madison
Principal Objective
Find a reduced number of input space features such that clustering in the reduced space closely replicates the clustering in the full dimensional space
Basic Idea Based on rigorous optimization theory, make a simple
but fundamental modification in one of the two steps of the k-median algorithm
In each cluster, find a point closest in the 1-norm to all points in that cluster and to the median of ALL data points
Proposed approach can lead to a feature reduction as high as 64%, with clustering comparable to within 4% to that with the original set of features
Based on increasing weight given to the data median, more features are deleted from problem
FSKM Example
Start with median at origin
Apply k-median algorithm
As weight of data median increases, features are removed from the problem
Outline of Talk
Ordinary k-median algorithm
Two steps of the algorithm
Feature Selecting k-Median (FSKM) Algorithm
Overall optimization objective
Basic idea Mathematical optimization formulation Algorithm statement
Numerical examplesConclusion & outlook
Ordinary k-Median Algorithm
Given m data points in n-dimensional input feature spaceFind k cluster centers with the following propertyThe sum of the 1-norm distances between each data point
and the closest cluster center is minimizedFinding the minimum of a bunch of linear functions
is a concave minimization problem and is NP-hardHowever, the two-step k-median algorithm
terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality condition
Two-Step k-Median Algorithm
(0) Start with k initial cluster centers
(1) Assign each data point to a 1-norm closest cluster center
(2) For each cluster compute a new cluster center that is 1-norm closest to all points in the cluster (median of cluster)
(3) Stop if all cluster centers are unchanged else go to (1)
Algorithm terminates in a finite number of steps at a point satisfying the minimum principle necessary optimality conditions
Key Change in Step (2) of k-Median Algorithm
(0)(1)(2) For each cluster compute a new cluster center that
minimizes the sum of 1-norm distances to all points in the cluster and a weighted 1-norm distance to the median of all data points
(3)Weight of 1-norm distance to dataset median determines number of features deleted:
For a zero weight no features are suppressed
For a sufficiently large weight all features are suppressed
and a weighted 1-norm distance to the median of all data points
FSKM Theory
Subgradients
f(y)-f(x) ¸ f(x)0(y-x) 8 x,y 2 Rn Consider ||x||1 , x 2 R1
If x < 0 ||x||1 = -1
If x > 0 ||x||1 = 1
If x = 0 ||x||1 2 [-1, 1]
FSKM Theory (Continued)
Zeroing Cluster Features(Based on Necessary and Sufficient Optimality Conditions
for Nondifferentiable Convex Optimization)
Thatis, cj = 0; whenever
FSKM Algorithm
Thatis, cj = 0; whenever
FSKM Example (Revisited)
Start with median at origin Apply k-median algorithm Compute ’s
x1 = 1
y1 = 5
x2 = 0
y2 = 4
max x = 1 max y = 5 For =1, feature x is removed
from the problem
1
2
x
y
Numerical Testing
Thatis, cj = 0; whenever
FSKM tested on five publicly available labeled datasets
Labels were used only to test effectiveness of FSKM
Data is first clustered using k-median then FSKM is applied to delete one feature at a time
Without using data labels, “error” in FSKM clustering with reduced features is obtained by comparison with the “gold standard” clustering with the full set of features
FSKM clustering error curve obtained without labels is compared with classification error curve obtained using data labels
3-Class Wine Dataset178 Points in 13-dimensional Space
Remarks
Curves close togetherLargest increase in error
as last few features are removed
Reduced 13 features to 4:Clustering error < 4%Classification error
decreased by 0.56 percentage points
2-Class Votes Dataset435 Points in 16-dimensional Space
Remarks
Curves have similar shape Largest increase in error as
last few features are removed
Reduced 16 features to 3: Clustering error < 10% Classification error increased
by 1.84 percentage points
2-Class WDBC Dataset(Wisconsin Diagnostic Breast Cancer)569 Points in 30-dimensional Space
Remarks
Curves have similar shape for 14 and fewer features
First 3 features removed cause no change to either error curve
Reduced 30 features to 7: Clustering error < 10% Classification error increased
by 3.69 percentage points
2-Class Star/Galaxy-Bright Dataset2462 Points in 14-dimensional Space
Remarks
Clustering error increases gradually as number of features is reduced
Some features obstructing classification
Reduced 14 features to 4: Clustering error < 10% Classification error decreased
by 1.42 percentage points
2-Class Cleveland Heart Dataset297 Points in 13-dimensional Space
Remarks
Largest increase in both curves going from 13 to 9 features
Most features useful?Reduced 13 features to
8:Clustering error < 17%Classification error
increased by 7.74 percentage points
Conclusion
FSKM is a fast method for selecting relevant features while maintaining clusters similar to those in the original full dimensional space
Features selected by FSKM without labels may be useful for labeled data classification as well
FSKM eliminates costly search for appropriately reduced number of features required for clustering in smaller dimensional spaces (e.g. 14-choose-6 = 3003 k-median runs to get best 6 features out of 14 for the Star/Galaxy-Bright dataset compared to 9 k-median runs required by FSKM)
Outlook
Feature & data selection for support vector machinesSparse kernel approximation methodsGene expression selection
Incorporation of prior knowledge into learning
Optimization-based clustering may be useful in other machine learning applications
Minimalist supervised & unsupervised
learningSelect minimal knowledge for best model
Web Pages(Containing Paper & Talk)
www:cs:wisc:edu=øolvi
www:cs:wisc:edu=øwildt