Efficient and Numerically Stable Sparse Learning
Sihong Xie1, Wei Fan2, Olivier Verscheure2, and Jiangtao
Ren3
1University of Illinois at Chicago, USA2 IBM T.J. Watson Research Center, New York, USA
3 Sun Yat-Sen University, Guangzhou, China
Applications
• Signal processing (compressive sensing, MRI, coding, etc.)
• Computational Biology (DNA array sensing, gene expression pattern annotation )
• Geophysical Data Analysis
• Machine learning
Algorithms• Greedy selection
– Via L-0 regularization– Boosting, forward feature selection not for large scale
problem
• Convex optimization– Via L-1 regularization (e.g. Lasso)– IPM (interior point method) medium size problem
– Homotopy method full regularization path computation
– Gradient descent– Online algorithm (Stochastic Gradient Descent)
Rising awareness of Numerical Problems in ML
• Efficiency– SVM, beyond Optimization black box solver– Large scale problems, parallelization– Eigenvalue problems, randomization
• Stability– Gaussian process calculation, solving large system of
linear equations, matrix inversion– Convergence of gradient descent, matrix iteration
computation
• For more topics of numerical mathematics in ML, See : ICML Workshop on Numerical Methods in Machine Learning 2009
Stability in Sparse learning
• Iterative Hard Thresholding (IHT)– Solve the following optimization problem
– Incorporating gradient descent with hard thresholding
• Iterative Hard Thresholding (IHT)– Simple and scalable– With RIP assumption, previous methods
[BDIHT09, GK09] shows that iterative hard thresholding converges.
– Without the assumption of the spectral radius of the iteration matrix, such methods may diverge.
Stability in Sparse learning
Stability in Sparse learning• Gradient Descent with Matrix Iteration
• Error Vector
• Error Vector of IHT
Stability in Sparse learning
• Mirror Descent Algorithm for Sparse Learning (SMIDAS)
Dual vector Primal vector
1. Recover predictors from the Dual vector
2. Gradient Descent and Soft-threshold
Stability in Sparse learning• Elements of the Primal Vector is
exponentially sensitive to the corresponding elements of the Dual Vector
• Due to limited precisioin, small components will be omitted when computing
d is the dimensionality
of data
Needed in Prediction
Stability in Sparse learning
• Example– Suppose data are
Efficiency of Sparse Learning• Sparse models
– Less computational cost– Lower generalization bound
• Existing sparse learning algorithms may not good at trading off between sparsity and accuracy
Over complicated models are produced with lower accuracy
Can we get accurate models with higher
sparsity?For a theoretical treatment of trading off between accuracy and sparsitysee S. Shalev-Shwartz, N. Srebro, and T. Zhang. Trading accuracy for sparsity. Technical report, TTIC, May 2009.
The proposed method
Perceptron + soft-thresholding• Motivation
– Soft-thresholding• L1-regularization for sparse model
– Perceptron1. Avoids updates when the current features are able to
predict well
2. Convergence under soft-thresholding and limited precision (Lemma 2 and Theorem 1)
3. Compression (Theorem 2)
4. Generalization error bound (Theorem 3)
Don’t complicate the model when
unnecessary
Experiments
• Datasets
Large Scale Contesthttp://largescale.first.fraunhofer.de/instructions/
Experiments Divergence of IHT
• For IHT to converge
• The iteration matrices found in practice don’t meet this condition• For IHT (GraDes) with learning rate set to 1/3 and 1/100, respectively, we found …
Experiments Numerical problem of MDA
• Train models with 40% density.
• Parameter p is set to 2ln(d) (p=33) and 0.5 ln(d) respectively
• percentage of elements of the model within [em, em-52], indicating how many features will be lost during prediction
• Dynamical range indicate how wildly can the elements of model change.
Experiments Numerical problem of MDA
• How parameter p=O(ln(d)) affects performance– Smaller p, algorithm acts more like ordinary
stochastic gradient descent [GL1999]– Larger p, causing truncation during prediction– When dimensionality is high, MDA becomes
numerically unstable.
[GL1999] Claudio Gentile and Nick Littlestone. The robustness of the p-norm algorithms. In Proceeding of 12th Annual Conference on Computer Learning Theory, pages 1–11.ACM Press, New York, NY, 1999.
Experiments Overall comparison
• The proposed algorithm + 3 baseline sparse learning algorithms (all with logistic loss function)– SMIDAS (MDA based [ST2009])– TG (Truncated Gradient [LLZ2009])– SCD (Stochastic Coordinate Descent [ST2009])
Parameter tuning
• Run 10 times for each algorithm, find out the average accuracy on the validation set.
[ST2009] Shai Shalev-Shwartz and Ambuj Tewari, Stochastic methods for l1 regularized loss minimization. Proceedings of the 26th International Conference on Machine Learning,pages 929-936, 2009.[LLZ2009] John Langford, Lihong Li, and Tong Zhang. Sparse online learning via truncatedgradient. Journal of Machine Learning Research, 10:777–801, 2009.
Experiments Overall comparison
• Accuracy under the same model density– First 7 datasets: maximum 40% of features– Webspam: select maximum 0.1% of features– Stop running the program when maximum
percentage of features are selected
Experiments Overall comparison
• Accuracy vs. sparsity– The proposed algorithm works consistently better
than other baselines.– On 3 out of 5 tasks, stopped updating model before
reaching the maximum density (40% of features)– On task 1, outperforms others with 10% less features– On task 3, ties with the best baseline using less 20%
features– On task 1-7, SMIDAS: the smaller p,
the better accuracy, but it is beat byall other algorithms
Numerically unstable
Sparse
Generalizability
Convergence
Conclusion