Study of Sparse Classifier Study of Sparse Classifier Design AlgorithmsDesign Algorithms
Sachin Nagargoje, 08449
Advisor : Prof. Shirish Shevade
20th June 2013
OutlineOutline IntroductionSparsity w.r.t. features
◦ Using regularizer/penalty Traditional regularizer/penalty Other regularizer/penalty SparseNet
Sparsity w.r.t. support vectors / basis points◦ Various Techniques◦ SVM with L1 regularizer◦ Greedy Methods
Proposed MethodsExperimental ResultsConclusion / Future Work
2
3
Introduction
What is Sparsity?What is Sparsity?Sparsity w.r.t. features in model
◦eg: #Non - zero coefficients of model
Sparsity w.r.t. Support Vectors
4
Support Vectors,x1, …, xd
Vapnik 1992, Vapnik, et al 1995
Need for Sparsity?Need for Sparsity?• Faster prediction• Decreases complexity of model• In the case of sparsity w.r.t.
features• To Remove
– Redundant features– Irrelevant features– Noisy features
• As number of features increases– Data becomes sparse in High Dimension– Difficult to achieve low generalization error
5
• Filter–Select features before ML Algorithm is
run• E.g. Rank features and eliminate
• Wrapper–Find best subset of features using ML
techniques• E.g. Forward Selection, Random Selection
• Embedded–Feature selection as part of ML
Algorithm• E.g. L1 regularized linear regression
Traditional ways to achieve Traditional ways to achieve SparsitySparsity
6
Sparsity w.r.t. featuresSparsity w.r.t. features
7
Using Regularizer/PenaltyUsing Regularizer/PenaltyData, x= [x1, x2, … ,xn], Labels, y= [y1,
y2 … ,yn]T, Model, w = [w1, w2 … ,wp]A type of Embedded approachEg: In the case of linear least square
regression
R represents regularizer, eg: L0 or L1
8Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, Series B, 58:267, 1994.
Traditional regularizersTraditional regularizersL0 Penalty
L1 Penalty
9
Traditional regularizers Traditional regularizers contd..contd..
Example: Let us take Rainfall prediction problem Assuming, both model has same training errorModel 1 L0 Penalty = 1+ 1+ 1+ 1+ 1 = 5 L1 Penalty = |3| +|-5| +|8|+|-4|+|1|
= 21
Model 2 L0 Penalty = 1+ 0+ 1+ 1+ 0 = 3 L1 Penalty = |-20| +|0| +|7|+|18|+|0|
= 45
Since L1 shrinks and selects - it often selects dense model
10
Rainfall Prediction
Feature Model 1 Model 2
Temperature 3 -20
Outlook -5 0
Pressure 8 7
Wind -4 18
Humidity 1 0
Other regularizerOther regularizer
11
MC+MC+
12
SparseNetSparseNet
Uses Coordinate Descent with Non-convex PenaltyLets consider least square problem for single
feature data matrix:
It has a closed form solution as:
Our goal is to minimize:
13Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descentwith non-convex penalties, 2009
SparseNet (cont.)SparseNet (cont.)• Let us define a soft threshold
operator as below:
• There are three cases here : w>0, w<0, w=0
• Convert multiple feature function into single feature function
• Apply Coordinate Descent14
Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009
SparseNet (cont.)SparseNet (cont.)Now let us extend our problem to
solve data matrix with multiple features
Therefore soft threshold operator function becomes
15- Rahul Mazumder, Jerome Friedman, and Trevor Hastie. Sparsenet: Coordinate descent with non-convex penalties, 2009- Jerome Friedman, Trevor Hastie, Holger H¨ofling, and Robert Tibshirani. Pathwise coordinate optimization. Technical report, Annals of Applied Statistics, 2007.
SparseNet (Algorithm)SparseNet (Algorithm)
16
SparseNet with L1 PenaltySparseNet with L1 PenaltyUsing L1 Penalty
17Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.
SparseNet with MC+ SparseNet with MC+ PenaltyPenalty
Using MC+ Penalty
18Slice Localization Dataset, A. Frank and A. Asuncion. UCI machine learning repository, 2010.
Sparsity w.r.t. Sparsity w.r.t. Support vectors/Basis pointsSupport vectors/Basis points
19
Sparsity w.r.t. Support Sparsity w.r.t. Support vectorsvectorsKernel based learning algorithmsf(x) is linear combination of
terms of form
20
Various techniquesVarious techniquesSupport Vector Machine (SVM)
◦SVM with L1 penaltyGreedy methods (wrapper):
◦Kernel Matching Pursuit (KMP)◦Building SVM with sparser
complexity (Keerthi et al)Proposed method:
◦Preprocessing the training points using filtering and then applying wrapper methods
21
• Settings: Data• SVM optimization:
• SVM with L1 Penalty:• Solved using Linear Programming
• Settings used: – Lambda: [1/100 1/10 1 10 100 ], Sigma: [1/16 ¼ 1 4 16] 22
SVM with L1 regularizer
Decision Boundaries and Support Vectors
23
SVM with L1 regularizerSVM with L1 regularizer
RBF Kernel on dummy data
Poly & RBF Kernel on Banana data
24
SVM with L1 regularizer
Our formulation gave better sparser results than SVM
DatasetsDatasets
25
Greedy methods
Kernel Matching PursuitKernel Matching PursuitInspired from signal processing community
◦Decomposes any signal into a linear expansion of waveforms selected from dictionary of functions
Set of basis points are constructed in greedy fashion
Removes the requirement of positive definiteness of Kernel matrix
Allow us to directly control the sparsity (in terms of number of support vectors)
26Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.
Kernel Matching PursuitKernel Matching PursuitSetup:
◦ ◦ D, finite dictionary of functions,
◦ , l = # training points◦ n = # support vectors chosen so far
◦ At (n+1)th step & are to be chosen s.t. :
◦ Predictor:
where = indexes of SVs
27Pascal Vincent and Yoshua Bengio. Kernel matching pursuit. Machine Learning, Sep 2002.
Basis points versus Support Basis points versus Support VectorsVectors
28
- Dataset: http://mldata.org/repository/data/viewslug/banana-ida/- S. Sathiya Keerthi, et al. Building support vector machines with reduced classier complexity. JMLR, 2006.- Vladimir Vapnik, Steven E. Golowich, and Alex J. Smola. Support vector method for function approximation, regression estimation and signal processing. NIPS, 1996
Basis Points / Support Vectors
29
Proposed methods
Two step process: Step 1: Choosing subset of training set:
◦ Modified BIRCH Clustering Algorithm◦ K-means Clustering◦ GMM Clustering
Step 2: Apply Greedy Algorithm◦ Kernel Matching Pursuit (KMP)◦ Building SVM with sparser complexity (Keerthi et al)
30
Proposed methods
Modified BIRCH
Training Points
K-means
GMM Clustering
KMP / Keerthi et
alModelBasis
Points
- S. Sathiya Keerthi, Olivier Chapelle, and Dennis DeCoste. Building support vector machines with reduced classifier complexity. JOURNAL OF MACHINE LEARNING RESEARCH, 2006.- Tapas Kanungo, David M. Mount, Nathan S. Netanyahu, Christine D. Piatko, Ruth Silverman, Angela Y. Wu, Senior Member, and Senior Member. An efficient k-means clustering algorithm: Analysis and implementation.2002
BIRCH basicsBIRCH basics Balanced Iterative Reducing and Clustering using
Hierarchies Uses one-scan over dataset, therefore suits large
dataset Each CF vector of cluster is defined as (N,LS,SS),
N=data points, LS=Linear Sum, SS=Squared Sum Merging of two clusters:
◦ CF1 + CF2 = (Nl + N2, LSl + LS2, ,SS1 + SS2)
CF Tree◦ Height balanced tree ◦ Two factors:
B (branching factor): Each non-leaf node contains at most B entries [CFi, childi], i=1..B, CFi is sub-cluster represented by childi. A leaf node contains at most L entries [CFi], i=1..L
T (threshold): radius/diameter of cluster
31
Root
LN1
LN2 LN3
LN1 LN2 LN3
sc1
sc2
sc3sc4
sc5 sc6sc7
sc1 sc2sc3 sc4
sc5 sc6 sc7sc8
sc8
New subcluster
BIRCH Example
32- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,
Insertion into CF Tree
B=3L=3
LN1’
LN2 LN3
LN1” LN3
sc1
sc2
sc3sc4
sc5 sc6sc7
sc1 sc2sc3sc4 sc5 sc6 sc7sc8
sc8
New subcluster
BIRCH Example
33
Root
LN1’
LN1”
LN2
Here, Branch factor of leaf node exceeds 3, so LN1 is split
- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,
LN1’
LN2 LN3
LN1 LN3
sc1
sc2
sc3sc4
sc5 sc6sc7
sc1sc2 sc3sc4 sc5 sc6 sc7sc8
sc8
New subcluster
BIRCH Example
34
Root
NLN2NLN1
NLN2
NLN1
LN1’
LN1”
LN2
Here, Branch factor of non-leaf node exceeds 3, so root is split and height of CF Tree increases by one
- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,
LN1’
LN2 LN3
LN1 LN3
sc1
sc2
sc3sc4
sc5 sc6sc7
sc1sc2 sc3sc4 sc5 sc6 sc7sc8
sc8
New Point
BIRCH Example
35
Root
NLN2NLN1
NLN2
NLN1
LN1’
LN1”
LN2
- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,
LN1’
LN2 LN3
LN1 LN3
sc1
sc2
sc3sc4
sc5 sc6
sc8
sc1sc2 sc3sc4 sc5 sc6 sc7sc8
sc8
New subcluster
BIRCH Example
36
Root
NLN2NLN1
NLN2
NLN1
LN1’
LN1”
LN2
Here, alien point falls inside leaf-node. Break it into parts.
Branch factor of leaf node exceeds 3, so LN3 should split ..
sc7
sc9
- www.cs.uvm.edu/~xwu/kdd/Birch-09.ppt- Tian Zhang, Raghu Ramakrishnan, and Miron Livny. Birch: an efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD international conference on Management of data, SIGMOD ’96,
Clusters using modified Clusters using modified BIRCHBIRCH
39
Centroids
41
ExperimentsExperiments
42
Datasets Used
Modified BIRCH with KMPModified BIRCH with KMPAfter Modified Birch – Optimization
using KMP's basic Algorithm
KMP with basic AlgorithmAfter Modified Birch – Optimization
using KMP's back-fitting Algorithm
KMP with Back-fitting Algorithm
Datasets # Basis Test Accuracy # Basis Test Accuracy # Basis Test Accuracy # Basis Test Accuracy
banana 44.1 0.7896 40 0.8893 81.44 0.8746 80 0.8758
breast-cancer 28.5 0.7273 40 0.7169 81.7 0.7156 80 0.7052
diabetis 48.9 0.7610 47 0.7653 185.7 0.7473 187 0.7497
flare-solar 78.5 0.6053 133 0.6593 311.2 0.5923 266 0.6210
german 75.4 0.7553 140 0.7683 252.3 0.7230 280 0.7487
heart 28.9 0.7910 34 0.8300 61.5 0.8060 68 0.7990
image 267.6 0.9567 260 0.9468 275.7 0.9566 260 0.9275
ringnorm 128.4 0.8993 160 0.8529 133.5 0.9104 160 0.8562
splice 385.9 0.7847 400 0.8726 394.4 0.8106 400 0.7033
thyroid 33 0.9533 56 0.9547 44.9 0.9493 28 0.9453
titanic 5.8 0.7701 8 0.7755 48.3 0.7783 60 0.7727
twonorm 21.3 0.9680 40 0.9746 103.6 0.9597 80 0.9544
waveform 74.7 0.7810 80 0.8909 92.3 0.8667 80 0.8717
svmguide1 295 0.9693 309 0.9693 740 0.9658 618 0.9663
svmguide3 297 0.8780 249 0.7073
43
Our formulation gave descent results (red color)
Multi - class Modified BIRCH Multi - class Modified BIRCH with KMPwith KMP
44
All multi-class datasets gave better results
Modified BIRCH with Keerthi et al’s Modified BIRCH with Keerthi et al’s methodmethod
Using modified BIRCH and Keerthi et al's method
Using Keerthi et al's method
Basis Test Accuracy Basis Test Accuracy
banana 16.38 .8798 20 0.8879
breast-cancer 16.30 .7195 20 0.7221
diabetis 36.40 .7593 23 0.7757
flare-solar 63.00 .6063 67 0.6678
german 50.60 .7503 70 0.7660
heart 11.10 .8070 9 0.8330
image 55.00 .8671 65 0.9440
ringnorm 133.50 .7776 80 0.9829
splice 78.90 .7312 100 0.8421
thyroid 9.00 .9347 7 0.9467
titanic 10.30 .7635 8 0.7753
twonorm 20.70 .9675 20 0.9757
waveform 20.63 .8665 20 0.8962
gisette 275.00 .9190 300 0.9740
svmguide1 154.00 .9648 154 0.9678
svmguide3 72.00 .7317 62 0.7317
w1a 144.00 .9702 124 0.9766
w2a 131.00 .9706 174 0.9796
w3a 196.00 .9711 246 0.9808
w4a 274.00 .9702 368 0.9825
w5a 369.00 .9705 494 0.9730
w6a 571.00 .9713 859 0.9733
w7a 793.00 .9711 1235 0.9747
w8a 1,370.00 .9740 2487 0.982345
SVM Using Kmeans to find basis points
Using GMM clustering to find basis points
Test accuracy SVs Test accuracy SVs Test accuracy SVs
banana 0.8881 186.76 0.8870 41 0.8851 41
breast-cancer 0.7425 131.64 0.7104 10.7 0.7104 40.4
diabetis 0.7532 275.45 0.7370 24.3 0.7397 24.3
german 0.7657 449.12 0.7413 141 0.7500 71
heart 0.8265 102.69 0.7870 9 0.8030 9
image 0.9202 441.6 0.9527 260.7 0.9385 260.7
ringnorm 0.9817 102.59 0.7633 21 0.7474 21
splice 0.8868 694.65 0.8288 51 0.8328 51
thyroid 0.9511 40.45 0.9227 8 0.9120 29
titanic 0.7726 72.32 0.7607 2.5 0.7157 8.5
twonorm 0.9727 126.52 0.9701 21 0.9698 21
waveform 0.9013 158.42 0.8878 21 0.8876 21
46
Gave sparse model with less testset accuracy (except in blue color)
K means and GMM with KMP
ConclusionConclusionStudied various sparse classifier design
algorithmsBetter results obtained using SVM with L1 Penalty.Modified BIRCH with KMP:
◦ gave descent result on binary datasets◦ gave good results on multi-class datasets◦ saved kernel calculations (and time) by almost ~1/5th of
actual timeClustering is an easy way (time consuming) to
choose basis points but not much effective.Future work:
◦ Explore greedy embedded sparse multi-class classification with different loss functions, e.g. Logistic Loss
◦ Explore such techniques for Semi-supervised learning
47
Thank You.Thank You.
48