View
8
Download
0
Category
Preview:
Citation preview
Linear Programming for Feature Selection viaRegularization
Yoonkyung LeeDepartment of Statistics
The Ohio State University(Joint work with Yonggang Yao)
July 2008
Outline
◮ Methods of regularization◮ Solution paths◮ Main optimization problems for feature selection◮ Overview of linear programming◮ Simplex algorithm for generating solution paths◮ Implications◮ Numerical examples◮ Concluding remarks
Regularization
◮ Tikhonov regularization (1943):solving ill-posed integral equation numerically
◮ Process of modifying ill-posed problems by introducingadditional information about the solution
◮ Modification of the maximum likelihood principle orempirical risk minimization principle(Bickel & Li 2006)
◮ Smoothness, sparsity, small norm, large margin, ...◮ Bayesian connection
Methods of Regularization (Penalization)
Find f ∈ F minimizing
1n
n∑
i=1
L(yi , f (x i)) + λJ(f ).
◮ Empirical risk + penalty◮ F : a class of candidate functions◮ J(f ): the complexity of a model f◮ λ > 0: a regularization parameter◮ Without the penalty J(f ), ill-posed problem
Examples of Regularization Methods
◮ Ridge regression (Hoerl and Kennard 1970)◮ LASSO (Tibshirani 1996)◮ Smoothing splines (Wahba 1990)◮ Support vector machines (Vapnik 1998)◮ Regularized neural network, boosting, logistic regression,
...
◮ Smoothing splines:Find f ∈ F = W2[0, 1]= {f : f , f ′ absolutely continuous, and f ′′ ∈ L2} minimizing
1n
n∑
i=1
(yi − f (xi))2 + λ
∫ 1
0(f ′′(x))2dx ,
where J(f ) =∫ 1
0 (f ′′(x))2dx .◮ Support vector machines:
Find f ∈ F = {f (x) = w⊤x + b | w ∈ Rp and b ∈ R}
minimizing
1n
n∑
i=1
(1 − yi f (x i))+ + λ‖w‖2,
where J(f ) = J(w⊤x + b) = ‖w‖2.
LASSO
minβ
n∑
i=1
(yi−
p∑
j=1
βjxij)2+λ‖β‖1 ⇔ min
β
n∑
i=1
(yi−
p∑
j=1
βjxij)2 s.t. ‖β‖1 ≤ s
β
LASSO coefficient paths
** * * * * * * * * ** *
0.0 0.2 0.4 0.6 0.8 1.0
−50
00
500
|beta|/max|beta|
Sta
ndar
dize
d C
oeffi
cien
ts
** * * ** *
* * * ** *
**
* * * * * * * * ** *
** **
* * * * * * ** *
** * * * * **
* *
**
*
** * * * * * * * *
***
** * ** * * * * *
***
** * * * * * ** * ** *
**
* * * * * * * ***
*
** * * * * * * * * ** *
LASSO
52
18
69
0 2 3 4 5 7 8 10 12
Solution Paths
◮ Each regularization method defines a continuum ofoptimization problems indexed by a tuning parameter.
◮ λ determines the trade-off between the prediction errorand the model complexity
◮ The entire set of solutions f or β as a function of λ
◮ Complete exploration of the model space andcomputational savings
◮ Examples◮ LARS (Efron et al. 2004)◮ SVM path (Hastie et al. 2004)◮ Multicategory SVM path (Lee and Cui 2006)◮ Piecewise linear paths (Rosset and Zhu 2007)◮ Generalized path seeking algorithm (Friedman 2008)
Main Problem
◮ Regularization for simultaneous fitting and feature selection◮ Convex piecewise linear loss functions◮ Penalties of ℓ1 nature for feature selection
◮ Parametric: LASSO-type◮ Nonparametric: COSSO-type
COmponent Selection and Smoothing Operator(Lin and Zhang 2003, Gunn and Kandola 2002)
◮ Non-differentiability of the loss and penalty◮ Linear programming (LP) problems indexed by a single
regularization parameter
◮ Examples◮ ℓ1-norm SVM
(Bradley and Mangasarian 1998, Zhu et al. 2004)◮ ℓ1-norm Quantile Regression (Li and Zhu 2005)◮ θ-step (kernel selection) for structured kernel methods
(Lee et al. 2006)◮ Dantzig selector (Candes and Tao 2005)◮ ǫ-insensitive loss in SVM regression◮ Sup norm, maxj=1,...,p |βj |
◮ Computational properties of the solutions to the problemscan be treated generally by tapping into the LP theory.
Linear Programming
◮ One of the cornerstones of the optimization theory◮ Applications in operation research, economics, business
management, and engineering◮ The simplex algorithm by Dantzig (1947)◮ ‘Parametric cost LP’ or ‘parametric right-hand-side LP’ in
the optimization theory◮ Exploit the connection to lay out general algorithms for the
solution paths of the feature selection problems.
Geometry of LP◮ Search the minimum of a linear function over a polyhedron
whose edges are defined by hyperplanes.◮ At least one of the intersection points of the hyperplanes
should attain the minimum if the minimum exists.
Linear Programming
◮ Standard form of LP
minz ∈ R
Nc′z
s.t. Az = bz ≥ 0,
where z is an N-vector of variables, c is a fixed N-vector, bis a fixed M-vector, and A is an M × N fixed matrix.
LP terminology
◮ A set B∗ := {B∗1, · · · , B∗
M} ⊂ N = {1, · · · , N} is called abasic index set, if AB∗ is invertible.
◮ z∗ ∈ RN is called the basic solution associated with B∗, if
z∗ satisfies{
z∗B∗ := (z∗
B∗
1, · · · , z∗
B∗
M)′ = A−1
B∗ bz∗
j = 0 for j ∈ N \ B∗.
◮ A basic index set B∗ is called a feasible basic index set ifA−1
B∗ b ≥ 0.◮ A feasible basic index set B∗ is also called an optimal basic
index set if[
c − A′(
A−1B∗
)′cB∗
]
≥ 0.
Optimality Condition for LP
TheoremLet z∗ be the basic solution associated with B∗, an optimalbasic index set. Then z∗ is an optimal basic solution.
◮ The standard LP problem can be solved by finding theoptimal basic index set.
Parametric Linear Programs
◮ Standard form of a parametric-cost LP:
minz ∈ R
N(c + λa)′z
s.t. Az = bz ≥ 0
◮ Standard form of a parametric right-hand-side LP:
minz ∈ R
Nc′z
s.t. Az = b + ωb∗
z ≥ 0
Example: ℓ1-norm SVM
minβ0 ∈ R, β ∈ R
p
n∑
i=1
{1 − yi (β0 + xiβ)}+ + λ‖β‖1,
◮ In other words,
{
minβ0 ∈ R, β ∈ R
p, ζ ∈ Rn
∑ni=1(ζi)+ + λ‖β‖1
s.t. yi(β0 + xiβ) + ζi = 1 for i = 1, · · · , n.
z := ( β+
0 β−0 (β+)′ (β−)′ (ζ+)′ (ζ−)′ )′
c := ( 0 0 0′ 0′ 1′ 0′ )′
a := ( 0 0 1′ 1′ 0′ 0′ )′
A := ( Y −Y diag(Y )X −diag(Y )X I −I )b := 1.
Optimality Interval
CorollaryFor a fixed λ∗ ≥ 0, let B∗ be an optimal basic index set of theparametric-cost LP problem at λ = λ∗. Define
λ := max{j : a∗
j > 0; j ∈ N \ B∗}
(
−c∗ja∗
j
)
and λ := min{j : a∗
j < 0; j ∈ N \ B∗}
(
−c∗ja∗
j
)
,
where a∗j := aj − a′
B∗A−1B∗ Aj and c∗j := cj − c′
B∗A−1B∗ Aj for j ∈ N .
Then, B∗ is an optimal basic index set for λ ∈ [λ, λ], whichincludes λ∗.
Simplex Algorithm
1. Initialize the optimal basic index set at λ−1 = ∞ with B0.
2. Given Bl at λ = λl−1, determine the solution z l byz lBl = A−1
Bl b and z lj = 0 for j ∈ N \ Bl .
3. Find the entry index
j l = arg maxj : al
j > 0; j ∈ N \ Bl
(
−cl
j
alj
)
.
4. Find the exit index
i l = arg mini∈{j: d l
j <0, j∈Bl}
(
−z l
i
d li
)
.
5. Update the optimal basic index set to Bl+1 = Bl ∪{j l} \ {i l}.
6. Terminate the algorithm if clj l ≥ 0 or equivalently λl ≤ 0.
Otherwise, repeat 2 – 5.
TheoremThe solution path of the parametric-cost LP is
z0 for λ > λ0
z l for λl < λ < λl−1, l = 1, · · · , Jτz l + (1 − τ)z l+1 for λ = λl and τ ∈ [0, 1], l = 0, · · · , J − 1.
◮ The simplex algorithm gives a piecewise constant path.
An Illustrative Example
◮ x = (x1, . . . , x10) ∼ N(0, I)◮ A probit model:
Y = sign(β0 + xβ + ǫ) where ǫ ∼ N(0, 50)
◮ β0 = 0, βj = 2 for j = 1, 3, 5, 10, and 0 elsewhere◮ The Bayes error rate: 0.336◮ n = 400◮ ℓ1-norm SVM
2 4 6 8 10
−1.0
−0.5
0.0
0.5
− log(λ)
β 0
1
2
3
4
5
678
9
10
Figure: ℓ1-norm SVM coefficient path indexed by λ
(five-fold CV with 0-1 and hinge)
Alternative Formulation
◮ Example: ℓ1-norm SVM
minβ0 ∈ R, β ∈ R
p
n∑
i=1
{1 − yi (β0 + xiβ)}+
s.t. ‖β‖1 ≤ s
◮ As a parametric right-hand-side LP,
minz ∈ R
N , δ ∈ Rc′z
s.t. Az = ba′z + δ = sz ≥ 0, δ ≥ 0.
TheoremFor s ≥ 0, the solution path can be expressed as{ sl+1 − s
sl+1 − slz l + s − sl
sl+1 − slz l+1 if sl ≤ s < sl+1 and l = 0, · · · , J − 1
zJ if s ≥ sJ ,
where sl = a′z l .
◮ The simplex algorithm gives a piecewise linear path.
0.0 0.5 1.0 1.5 2.0 2.5
−1.0
−0.5
0.0
0.5
s
β 0
1
2
3
4
5
678
9
10
Figure: ℓ1-norm SVM coefficient path indexed by s(five-fold CV with 0-1 and hinge)
1.0 1.5 2.0 2.5
0.33
0.35
0.37
0.39
s
Err
or R
ate
Figure: The true error rate path for the l1-norm SVM under the probitmodel
Annual Household Income Data
◮ http://www-stat.stanford.edu/∼tibs/ElemStatLearn/◮ Predict the annual household income with 13 demographic
attributes (education, age, gender, marital status,occupation, householder status etc.).
◮ The response takes one of nine income brackets specified.◮ Split 6,876 records into a training set of 2,000 and a test
set of 4,876.
=<
Gra
de 8
Gra
de 9
−11
Hig
h S
choo
l
Col
lege
1−
3
Col
lege
Gra
duat
e
20
40
60
80
100
Inco
me
Education
14−
17
18−
24
25−
34
35−
44
45−
54
55−
64
>65
20
40
60
80
100
Inco
me
Age
Mal
e
Fem
ale
20
40
60
80
100
Inco
me
Gender
Mar
ried
Tog
ethe
r
Div
orce
d
Wid
owed
Sin
gle
20
40
60
80
100
Inco
me
Marital Status
Pro
f.
Sal
es
Fac
tory
Cle
rical
Hom
e
Stu
dent
Mili
tary
Ret
ire
Une
mpl
oy
20
40
60
80
100
Inco
me
Occupation
Ow
n
Ren
t
Par
ents
20
40
60
80
100
Inco
me
House Status
Figure: Boxplots of the annual household income with education,age, gender, marital status, occupation, and householder status outof 13 demographic attributes in the data
Median Regression with l1 Penalty
minβ
n∑
i=1
∣
∣
∣
∣
yi −
p∑
j=1
βjxij
∣
∣
∣
∣
subject to ‖β‖1 ≤ s
◮ Main effect model with 35 variables plus a quadratic termfor age
◮ Partial two-way interaction model with additional 69two-way interactions (out of 531 potential interaction terms)
Main effect model
0 10 20 30 40 50 60
−4−2
02
46
8
s
β
Figure: Positive: home ownership (in dark blue relative to renting), education (in brown), dual income due tomarriage (in purple relative to ‘not married’), age (in skyblue), and male (in light green). Negative: single or divorced(in red relative to ‘married’) and student, clerical worker, retired or unemployed (in green relative toprofessionals/managers)
Two-way interaction model
0 20 40 60 80 100 120
−50
5
s
β
Figure: Positive: ‘dual income ∗ home ownership’, ‘home ownership ∗education’, and ‘married but no dual income ∗ education’. Negative:‘single ∗ education’ and ‘home ownership ∗ age’
Risk Path
20 40 60 80 100 120
7.6
7.8
8.0
8.2
8.4
s
Estim
ated
Risk
Figure: The risks of the two-way fitted models are estimated by usinga test data set with 4,876 observations.
Refinement of the Simplex Algorithm
◮ The simplex algorithm assumes the non-degeneracy ofsolutions that z l 6= z l+1 for each l .
◮ Tableau-simplex algorithm with the anti-cycling property formore general settings
◮ Structural commonalities in the elements of the standardLP form can be utilized for efficient computation.
Concluding Remarks
◮ Establish the connection between a family of regularizationproblems for feature selection and the LP theory.
◮ Shed a new light on solution path-finding algorithms for theoptimization problems in comparison with the existingalgorithms.
◮ Provide fast and efficient computational tools for screeningand selection of features for regression and classificationproblems.
◮ Unified algorithm with modular treatment of differentprocedures (lpRegPath)
◮ Model selection (or averaging) and validation◮ Optimization theory and tools are very useful to
statisticians.
Reference
◮ Another Look at Linear Programming for Feature Selectionvia Methods of Regularization, Yao, Y., and Lee, Y.,Technical Report No. 800, Department of Statistics, TheOhio State University, 2007.
◮ For a copy of the paper and slides of this talk, visithttp://www.stat.ohio-state.edu/∼yklee
◮ E-mail: yklee@stat.osu.edu
Recommended