Linear Programming for Feature Selection via Regularization

Linear Programming for Feature Selection viaRegularization

Yoonkyung LeeDepartment of Statistics

The Ohio State University(Joint work with Yonggang Yao)

July 2008

Outline

◮ Methods of regularization◮ Solution paths◮ Main optimization problems for feature selection◮ Overview of linear programming◮ Simplex algorithm for generating solution paths◮ Implications◮ Numerical examples◮ Concluding remarks

Regularization

◮ Tikhonov regularization (1943):solving ill-posed integral equation numerically

◮ Process of modifying ill-posed problems by introducingadditional information about the solution

◮ Modification of the maximum likelihood principle orempirical risk minimization principle(Bickel & Li 2006)

◮ Smoothness, sparsity, small norm, large margin, ...◮ Bayesian connection

Methods of Regularization (Penalization)

Find f ∈ F minimizing

L(yi , f (x i)) + λJ(f ).

◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): the complexity of a model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem

Examples of Regularization Methods

◮ Ridge regression (Hoerl and Kennard 1970)◮ LASSO (Tibshirani 1996)◮ Smoothing splines (Wahba 1990)◮ Support vector machines (Vapnik 1998)◮ Regularized neural network, boosting, logistic regression,

◮ Smoothing splines:Find f ∈ F = W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing

(yi − f (xi))2 + λ

0(f ′′(x))2dx ,

where J(f ) =∫ 1

0 (f ′′(x))2dx .◮ Support vector machines:

Find f ∈ F = {f (x) = w⊤x + b | w ∈ Rp and b ∈ R}

minimizing

(1 − yi f (x i))+ + λ‖w‖2,

where J(f ) = J(w⊤x + b) = ‖w‖2.

(yi−

βjxij)2+λ‖β‖1 ⇔ min

(yi−

βjxij)2 s.t. ‖β‖1 ≤ s

LASSO coefficient paths

** * * * * * * * * ** *

0.0 0.2 0.4 0.6 0.8 1.0

|beta|/max|beta|

** * * ** *

* * * ** *

* * * * * * * * ** *

* * * * * * ** *

** * * * * **

** * * * * * * * *

** * ** * * * * *

** * * * * * ** * ** *

* * * * * * * ***

** * * * * * * * * ** *

0 2 3 4 5 7 8 10 12

Solution Paths

◮ Each regularization method defines a continuum ofoptimization problems indexed by a tuning parameter.

◮ λ determines the trade-off between the prediction errorand the model complexity

◮ The entire set of solutions f or β as a function of λ

◮ Complete exploration of the model space andcomputational savings

◮ Examples◮ LARS (Efron et al. 2004)◮ SVM path (Hastie et al. 2004)◮ Multicategory SVM path (Lee and Cui 2006)◮ Piecewise linear paths (Rosset and Zhu 2007)◮ Generalized path seeking algorithm (Friedman 2008)

Main Problem

◮ Regularization for simultaneous fitting and feature selection◮ Convex piecewise linear loss functions◮ Penalties of ℓ1 nature for feature selection

◮ Parametric: LASSO-type◮ Nonparametric: COSSO-type

COmponent Selection and Smoothing Operator(Lin and Zhang 2003, Gunn and Kandola 2002)

◮ Non-differentiability of the loss and penalty◮ Linear programming (LP) problems indexed by a single

regularization parameter

◮ Examples◮ ℓ1-norm SVM

(Bradley and Mangasarian 1998, Zhu et al. 2004)◮ ℓ1-norm Quantile Regression (Li and Zhu 2005)◮ θ-step (kernel selection) for structured kernel methods

(Lee et al. 2006)◮ Dantzig selector (Candes and Tao 2005)◮ ǫ-insensitive loss in SVM regression◮ Sup norm, maxj=1,...,p |βj |

◮ Computational properties of the solutions to the problemscan be treated generally by tapping into the LP theory.

Linear Programming

◮ One of the cornerstones of the optimization theory◮ Applications in operation research, economics, business

management, and engineering◮ The simplex algorithm by Dantzig (1947)◮ ‘Parametric cost LP’ or ‘parametric right-hand-side LP’ in

the optimization theory◮ Exploit the connection to lay out general algorithms for the

solution paths of the feature selection problems.

Geometry of LP◮ Search the minimum of a linear function over a polyhedron

whose edges are defined by hyperplanes.◮ At least one of the intersection points of the hyperplanes

should attain the minimum if the minimum exists.

Linear Programming

◮ Standard form of LP

minz ∈ R

Nc′z

s.t. Az = bz ≥ 0,

where z is an N-vector of variables, c is a fixed N-vector, bis a fixed M-vector, and A is an M × N fixed matrix.

LP terminology

◮ A set B∗ := {B∗1, · · · , B∗

M} ⊂ N = {1, · · · , N} is called abasic index set, if AB∗ is invertible.

◮ z∗ ∈ RN is called the basic solution associated with B∗, if

z∗ satisfies{

z∗B∗ := (z∗

1, · · · , z∗

M)′ = A−1

B∗ bz∗

j = 0 for j ∈ N \ B∗.

◮ A basic index set B∗ is called a feasible basic index set ifA−1

B∗ b ≥ 0.◮ A feasible basic index set B∗ is also called an optimal basic

index set if[

c − A′(

A−1B∗

)′cB∗

≥ 0.

Optimality Condition for LP

TheoremLet z∗ be the basic solution associated with B∗, an optimalbasic index set. Then z∗ is an optimal basic solution.

◮ The standard LP problem can be solved by finding theoptimal basic index set.

Parametric Linear Programs

◮ Standard form of a parametric-cost LP:

minz ∈ R

N(c + λa)′z

s.t. Az = bz ≥ 0

◮ Standard form of a parametric right-hand-side LP:

minz ∈ R

Nc′z

s.t. Az = b + ωb∗

z ≥ 0

Example: ℓ1-norm SVM

minβ0 ∈ R, β ∈ R

{1 − yi (β0 + xiβ)}+ + λ‖β‖1,

◮ In other words,

p, ζ ∈ Rn

∑ni=1(ζi)+ + λ‖β‖1

s.t. yi(β0 + xiβ) + ζi = 1 for i = 1, · · · , n.

z := ( β+

0 β−0 (β+)′ (β−)′ (ζ+)′ (ζ−)′ )′

c := ( 0 0 0′ 0′ 1′ 0′ )′

a := ( 0 0 1′ 1′ 0′ 0′ )′

A := ( Y −Y diag(Y )X −diag(Y )X I −I )b := 1.

Optimality Interval

CorollaryFor a fixed λ∗ ≥ 0, let B∗ be an optimal basic index set of theparametric-cost LP problem at λ = λ∗. Define

λ := max{j : a∗

j > 0; j ∈ N \ B∗}

−c∗ja∗

and λ := min{j : a∗

j < 0; j ∈ N \ B∗}

−c∗ja∗

where a∗j := aj − a′

B∗A−1B∗ Aj and c∗j := cj − c′

B∗A−1B∗ Aj for j ∈ N .

Then, B∗ is an optimal basic index set for λ ∈ [λ, λ], whichincludes λ∗.

Simplex Algorithm

1. Initialize the optimal basic index set at λ−1 = ∞ with B0.

2. Given Bl at λ = λl−1, determine the solution z l byz lBl = A−1

Bl b and z lj = 0 for j ∈ N \ Bl .

3. Find the entry index

j l = arg maxj : al

j > 0; j ∈ N \ Bl

4. Find the exit index

i l = arg mini∈{j: d l

j <0, j∈Bl}

−z l

5. Update the optimal basic index set to Bl+1 = Bl ∪{j l} \ {i l}.

6. Terminate the algorithm if clj l ≥ 0 or equivalently λl ≤ 0.

Otherwise, repeat 2 – 5.

TheoremThe solution path of the parametric-cost LP is

z0 for λ > λ0

z l for λl < λ < λl−1, l = 1, · · · , Jτz l + (1 − τ)z l+1 for λ = λl and τ ∈ [0, 1], l = 0, · · · , J − 1.

◮ The simplex algorithm gives a piecewise constant path.

An Illustrative Example

◮ x = (x1, . . . , x10) ∼ N(0, I)◮ A probit model:

Y = sign(β0 + xβ + ǫ) where ǫ ∼ N(0, 50)

◮ β0 = 0, βj = 2 for j = 1, 3, 5, 10, and 0 elsewhere◮ The Bayes error rate: 0.336◮ n = 400◮ ℓ1-norm SVM

2 4 6 8 10

−1.0

−0.5

− log(λ)

Figure: ℓ1-norm SVM coefficient path indexed by λ

(five-fold CV with 0-1 and hinge)

Alternative Formulation

◮ Example: ℓ1-norm SVM

{1 − yi (β0 + xiβ)}+

s.t. ‖β‖1 ≤ s

◮ As a parametric right-hand-side LP,

minz ∈ R

N , δ ∈ Rc′z

s.t. Az = ba′z + δ = sz ≥ 0, δ ≥ 0.

TheoremFor s ≥ 0, the solution path can be expressed as{ sl+1 − s

sl+1 − slz l + s − sl

sl+1 − slz l+1 if sl ≤ s < sl+1 and l = 0, · · · , J − 1

zJ if s ≥ sJ ,

where sl = a′z l .

◮ The simplex algorithm gives a piecewise linear path.

0.0 0.5 1.0 1.5 2.0 2.5

−1.0

−0.5

Figure: ℓ1-norm SVM coefficient path indexed by s(five-fold CV with 0-1 and hinge)

1.0 1.5 2.0 2.5

Figure: The true error rate path for the l1-norm SVM under the probitmodel

Annual Household Income Data

◮ http://www-stat.stanford.edu/∼tibs/ElemStatLearn/◮ Predict the annual household income with 13 demographic

attributes (education, age, gender, marital status,occupation, householder status etc.).

◮ The response takes one of nine income brackets specified.◮ Split 6,876 records into a training set of 2,000 and a test

set of 4,876.

Education

Gender

Marital Status

Occupation

House Status

Figure: Boxplots of the annual household income with education,age, gender, marital status, occupation, and householder status outof 13 demographic attributes in the data

Median Regression with l1 Penalty

yi −

βjxij

subject to ‖β‖1 ≤ s

◮ Main effect model with 35 variables plus a quadratic termfor age

◮ Partial two-way interaction model with additional 69two-way interactions (out of 531 potential interaction terms)

Main effect model

0 10 20 30 40 50 60

−4−2

Figure: Positive: home ownership (in dark blue relative to renting), education (in brown), dual income due tomarriage (in purple relative to ‘not married’), age (in skyblue), and male (in light green). Negative: single or divorced(in red relative to ‘married’) and student, clerical worker, retired or unemployed (in green relative toprofessionals/managers)

Two-way interaction model

0 20 40 60 80 100 120

Figure: Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗education’, and ‘married but no dual income ∗ education’. Negative:‘single ∗ education’ and ‘home ownership ∗ age’

Risk Path

20 40 60 80 100 120

Figure: The risks of the two-way fitted models are estimated by usinga test data set with 4,876 observations.

Refinement of the Simplex Algorithm

◮ The simplex algorithm assumes the non-degeneracy ofsolutions that z l 6= z l+1 for each l .

◮ Tableau-simplex algorithm with the anti-cycling property formore general settings

◮ Structural commonalities in the elements of the standardLP form can be utilized for efficient computation.

Concluding Remarks

◮ Establish the connection between a family of regularizationproblems for feature selection and the LP theory.

◮ Shed a new light on solution path-finding algorithms for theoptimization problems in comparison with the existingalgorithms.

◮ Provide fast and efficient computational tools for screeningand selection of features for regression and classificationproblems.

◮ Unified algorithm with modular treatment of differentprocedures (lpRegPath)

◮ Model selection (or averaging) and validation◮ Optimization theory and tools are very useful to

statisticians.

Reference

◮ Another Look at Linear Programming for Feature Selectionvia Methods of Regularization, Yao, Y., and Lee, Y.,Technical Report No. 800, Department of Statistics, TheOhio State University, 2007.

◮ For a copy of the paper and slides of this talk, visithttp://www.stat.ohio-state.edu/∼yklee

◮ E-mail: yklee@stat.osu.edu

Linear Programming for Feature Selection via Regularization

Documents

Regularization Paths for Generalized Linear Models … · Regularization Paths for Generalized Linear Models via Coordinate Descent ... that the coordinate-wise update has the form

Machine Learning Lecture Regression basics...Machine Learning Regression basics linear regression, non-linear features (polynomial, RBFs, piece-wise), regularization, cross validation,

Equivalence of Some Common Linear Feature

Uniform convergence of regularization methods for linear ill-posed problems · 2017. 2. 10. · Keyword: Regularization, ill-posed problems. 1. Introduction and preliminary results

L Regularization Path Algorithm for Generalized Linear Modelsweb.stanford.edu/~hastie/Papers/glmpath.pdf · 2006-02-28 · L1 Regularization Path Algorithm for Generalized Linear

Lecture 5: Linear Regression with Regularization CSC …eniac.cs.qc.cuny.edu/andrew/gcml/lecture5.pdf · Lecture 5: Linear Regression with Regularization CSC 84020 - Machine Learning

Feature Engineering in Machine Learningnzaidi/presentations/... · A Machine Learning Primer Machine Learning and Data Science Bias-Variance Phenomenon Regularization What is Feature

Feature Reduction via Generalized Uncorrelated Linear ...jye02/Publications/Papers/gulda_tkde06.pdf · Feature Reduction via Generalized Uncorrelated Linear Discriminant Analysis

Regularization Paths for Generalized Linear Models via … · 2017. 5. 5. · 2 Regularization Paths for GLMs via Coordinate Descent 4. ‘ 1 regularization paths for generalized

Ill-Posedness and Regularization of Linear Operators (1 lecture)

Regularization Paths for Generalized Linear Models via Coordinate …statweb.stanford.edu/~jhf/ftp/glmnet.pdf · Regularization Paths for Generalized Linear Models via Coordinate

Regularization Methods for Linear Regression

Statistical Regularization Approaches for Linear/Nonlinear ...rosie/mypresentations/fortcollins.pdf · Statistical Regularization Approaches for Linear/Nonlinear Inverse Problems:

Regularization Parameter Estimation forrosie/mypresentations/prague.pdf · Regularization Parameter Estimation for ... Rao, C. R., 1973, Linear Statistical Inference and its applications,

Huber-Norm Regularization for Linear Prediction Models

Regularization Paths for Generalized Linear Models Glmnet

Regularization Paths for Generalized Linear Models via

Lecture 2: Over tting. Regularization · Lecture 2: Over tting. Regularization Generalizing regression Over tting Cross-validation L2 and L1 regularization for linear estimators A

Feature selection, L1 vs. L2 regularization, and rotational invariance

Regularization of linear inverse problems with total ... · regularization and convergence behavior for multiple parameters and functionals. However, all e orts aim towards regularization