Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington

Embed Size (px)

Citation preview

  • Slide 1
  • Blitz: A Principled Meta-Algorithm for Scaling Sparse Optimization Tyler B. Johnson and Carlos Guestrin University of Washington
  • Slide 2
  • Very important to machine learning Our focus is constrained convex optimization Number of constraints can be very large! Optimization Models of data Optimal model
  • Slide 3
  • Sparse regression Many constraints in dual problem Classification Example
  • Slide 4
  • Choices for Scaling Optimization Stochastic methods Parallelization Subject of this talk: Active Sets
  • Slide 5
  • Active Set Motivation Important fact
  • Slide 6
  • Convex objective f Convex Optimization with Active Sets Feasible set
  • Slide 7
  • 1. Choose set of constraints 2. Set x to minimize objective subject to chosen constraints Convex Optimization with Active Sets x Then repeat...
  • Slide 8
  • 2. Set x to minimize objective subject to chosen constraints 1. Choose set of constraints Algorithm converges when x is feasible! x Convex Optimization with Active Sets
  • Slide 9
  • Limitations of Active Sets Until x is feasible do Propose active set of important constraints x Minimizer of objective s.t. only active set How many iterations to expect? x is infeasible until convergence Which constraints are important? How many constraints to choose? When to terminate subproblem?
  • Slide 10
  • Blitz 1. Update y to be extreme feasible point on segment [y,x] Feasible point y Minimizer subject to no constraints x
  • Slide 11
  • x y 2. Select top k constraints with boundaries closest to y Blitz
  • Slide 12
  • x y 3. Set x to minimize objective subject to selected constraints And repeat
  • Slide 13
  • x 1. Update y to be extreme feasible point on segment [y,x] y Blitz
  • Slide 14
  • 2. Choose top k constraints with boundaries closest to y y 3. Set x to minimize objective subject to selected constraints x When x = y, Blitz converges! Blitz
  • Slide 15
  • Blitz Intuition The key to Blitz is its y-update yx
  • Slide 16
  • Blitz Intuition The key to Blitz is its y-update If y update is large, Blitz is near convergence If y update is small, xyy
  • Slide 17
  • If y update is large, Blitz is near convergence If y update is small, then violated constraint greatly improves x next iteration xy Blitz Intuition y x must improve significantly
  • Slide 18
  • Main Theorem Theorem 2.1
  • Slide 19
  • Active Set Size for Linear Convergence Corollary 2.2
  • Slide 20
  • Constraint Screening Corollary 2.3
  • Slide 21
  • Tuning Algorithmic Parameters Theory guides choice of: Active set size Subproblem termination criteria Best fixed Tuned using theory
  • Slide 22
  • Recap Blitz is an active set algorithm that: Selects theoretically justified active sets to maximize guaranteed progress Applies theoretical analysis to guide choice of algorithm parameters Discards constraints proven to be irrelevant during optimization
  • Slide 23
  • Empirical Evaluation
  • Slide 24
  • Experiment Overview Apply Blitz to L1-regularized loss minimization Dual is a constrained problem Optimizing subject to active set corresponds to solving primal problem over subset of variables
  • Slide 25
  • Single Machine, Data in Memory Relative Suboptimality Time (s) Active Sets Experiment with high-dimensional RCV1 dataset No Prioritization ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz
  • Slide 26
  • Limited Memory Setting Data cannot always fit in memory Active set methods require only a subset of data at each iteration to solve subproblem Set-up: 1 pass over data to load active set Solve subproblem with active set in memory Repeat
  • Slide 27
  • Limited Memory Setting Relative Suboptimality Time (s) Experiment with12 GB Webspam dataset and 1 GB memory Prioritized Memory Usage No Prioritization AdaGrad_1.0 AdaGrad_10.0 AdaGrad_100.0 CD Strong Rule Blitz
  • Slide 28
  • Distributed Setting With > 1 machine, communication is costly Blitz subproblems require communication for only active set features Set-up: Solve with synchronous bulk gradient descent Prioritize communication using active sets
  • Slide 29
  • Distributed Setting Experiment with Criteo CTR dataset and 16 machines Relative Suboptimality No Prioritization Prioritized Communication Time (min) Gradient Descent KKT Filter Blitz
  • Slide 30
  • Takeaways Active sets are effective at exploiting structure! We have introduced Blitz, an active sets algorithm that Provides novel, useful theoretical guarantees Is very fast in practice Future work Extensions to larger variety of problems Modifications such as constraint sampling Thanks!
  • Slide 31
  • References Bach, F., Jenatton, R., Mairal, J., and Obozinski, G. Optimization with sparsity-inducing penalties. Foundations and Trends in Machine Learning, 4(1):1106, 2012. Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:21212159, 2011. Fan, R. E., Chen, P. H., and Lin, C. J. Working set selection using second order information for training support vector machines. Journal of Machine Learning Research, 6: 18891918, 2005. Fercoq, O. and Richtrik, P. Accelerated, parallel and proximal coordinate descent. Technical Report arXiv:1312.5799, 2013. Friedman, J., Hastie, T., and Tibshirani, R. Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33(1):122, 2010. Ghaoui, L. E., Viallon, V., and Rabbani, T. Safe feature elimination for the lasso and sparse supervised learning problems. Pacific Journal of Optimization, 8(4):667 698, 2012. Kim, H. and Park, H. Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM Journal on Matrix Analysis and Applications, 30(2):713730, 2008. Kim, S. J., Koh, K., Lustig, M., Boyd, S., and Gorinevsky, D. An interior-point method for large-scale L1-regularized least squares. IEEE Journal on Selected Top- ics in Signal Processing, 1(4):606617, 2007. Koh, K., Kim, S. J., and Boyd, S. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research, 8:5191555, 2007. Li, M., Smola, A., and Andersen, D. G. Communication efficient distributed machine learning with the parameter server. In Advances in Neural Information Processing Systems 27, 2014. Tibshirani, R., Bien, J., Friedman, J., Hastie, T., Simon, N., Taylor, J., and Tibshirani, R. J. Strong rules for discarding predictors in lasso-type problems. Journal of the Royal Statistical Society, Series B, 74(2):245266, 2012. Tsochantaridis, I., Joachims, T., Hofmann, T., and Altun, Y. Large margin methods for structured and interdependent output variables. Journal of Machine Learning Re- search, 6:14531484, 2005. Xiao, L. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11:25432596, 2010. Yuan, G. X., Ho, C. H., and Lin, C. J. An improved GLMNET for L1-regularized logistic regression. Journal of Machine Learning Research, 13:19992030, 2012.
  • Slide 32
  • Active Set Algorithm 1.Until x is feasible do 2.Propose active set of important constraints 3.x Minimizer of objective s.t. only active set
  • Slide 33
  • Computing y Update Computing y update = 1D optimization problem Worst case, can be solved with bisection method For linear case, solution is simpler Requires considering all constraints
  • Slide 34
  • Single Machine, Data in Memory Support Set Recall Active Sets Experiment with high-dimensional RCV1 dataset No Prioritization ProxNewt CD L1_LR LIBLINEAR GLMNET Blitz Time (s) Support Set Precision