Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Problems in Sparse Multivariate Statistics with aDiscrete Optimization Lens

Rahul Mazumder

Massachusetts Institute of Technology

(Joint with D. Bertsimas, M. Copenhaver, A. King, P. Radchenko, H. Qin, J.

Goetz, K. Khamaru)

August, 2016

R. Mazumder Sparsity with Discrete Optimization 1

Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?


Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?


Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸︷︷︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours


Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸︷︷︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours


Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis


Outline








Outline








Best Subset Regression: Statement

[Bertsimas, King, M., ’16, Annals of Statistics]

I Usual linear regression model n samples, p regressors

I Want a sparse β with good data-fidelity:

minβ

1

2‖y −Xβ‖22 s.t. ‖β‖0 ≤ k, (?)

[Miller ’90; Foster & George ’94; George ’00 ]

I Problem (?) is NP-hard [Natarajan ’95].

I R package leaps can handle n ≥ p ≤ 31.(branch and leaps [Furnival & Wilson 1974])

I Not surprisingly, advised to stay away from Problem (?).


Best Subset Regression: Current Approaches & Limitations

I Lasso (`1) [Tibshirani ’96, Chen & Donoho ’98] is a very popular andeffective proxy:

minβ

12‖y −Xβ‖22 + λ‖β‖1,

I Computation: convex optimization, fast & scalable

I `1 =⇒ good models, under assumptions︸︷︷︸difficult to verify

I `1 6=⇒ reliable sparse solutions, and `1 6= `0 solutions.

[Buhlmann, Van de Geer ’11; Cai, Shen ’11; Zhang, Jiang ’08....]


Shortcomings of the Lasso: a simple explanation

I In presence of correlated variables, to obtain model with goodpredictive power, Lasso brings in a large number of nonzerocoefficients

I Lasso leads to biased estimates—`1-norm penalizes large and smallcoefficients uniformly.

I Upon increasing the degree of regularization, Lasso sets morecoefficients to zero—leaves out true predictors from the active set.


Best Subset Regression: `1 vs `0

I If β denotes the best subset solution, for any (fixed) X,

sup‖β∗‖0≤k

1

nE(‖Xβ −Xβ∗‖22) .

σ2k log p

n,

I If β`1 denotes a Lasso-based k-sparse estimator, then ∃ X:

1

γ2

σ2k1−δ log p

n. sup‖β∗‖0≤k

1

nE(‖Xβ`1 −Xβ∗‖22) .

1

γ2

σ2k log p

n,

I There is a significant gap between `0 and `1-type solutions.

[Bunea et. al. ’07; Raskutti et. al. ’09; Zhang et. al. ’14 ]


Best Subset Regression: Current Approaches & Limitations

I To circumvent shortcomings, alternatives exist

I Non-convex penalties/ greedy methods[Fan, Li ’01; Zou ’06; Zou, Li ’08; Zhang ’10 ; Mazumder et. al. ’11; Zhang ,

Zhang ’12; Loh, Wainwright ’14]

I Problems are non-convex and hard to solve.

I Computational approaches mostly heuristic:cannot certify/prove global optimality for arbitrary dataset.Exception: [Liu , Yao , Li ’16]


Best Subset Regression: Our approach

[Bertsimas, King, M., ’16, Annals of Statistics]

I Certifiably minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Main workhorses:

Tools from different branches of Optimization:

I Modern Technology of Mixed Integer Optimization (MIO)

I Discrete First Order methods (motivated from convexcontinuous optimization)



I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations



I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)




Brief Background on MIO


Mixed Integer Optimation (MIO)

I MIO: a particular class of discrete optimization problems

I The general form of a Mixed Integer Quadratic Optimization:

min αTQα + αTa

s.t. Aα ≤ b

αi ∈ 0, 1, ∀i ∈ I

αj ∈ <+, ∀j /∈ I,

a ∈ <m,A ∈ <k×m,b ∈ <k and Q ∈ <m×m (PSD)problem-parameters;

I Special instances: Mixed Integer Linear Optimization,Quadratic/Linear Programming...



I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)






I Modern MIO is tractable (in practice)

tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory







I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory



Progress of MIO

I Algorithms and Software have undergone huge improvements overpast 25+ years (1991 - 2016).

I Algorithms speed-up: ∼1.4 million times(Combined speedup: CPLEX 1.2 to 11 & Gurobi 1.0 to 6.5)

I Hardware speed-up: ∼1.6 million times(Peak Supercomputer performance)

I Total speed-up: 2.2 trillion times!

I Commercial packages: Xpress, Gurobi, Cplex,...

Non-commercial packages: GLPK, lpsolve, CBC, SCIP,...

Interfaces: Matlab, R, Python, Julia (JuMP)


Back to Formulation


Vanilla MIO formulation

For problem: minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k,

A simple (natural) MIO formulation is given by

minβ,z

12‖y −Xβ‖22

s.t. |βi| ≤M · zi, i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p,

where, M (“Big-M”) is a parameter— M ≥ ‖β‖∞— M controls the strength of the MIO formulation


Diabetes Dataset, n = 350, p = 64, k = 6

Typical behavior of Overall Algorithm

0 100 200 300 400

0.6

80

.70

0.7

20

.74

0.7

60

.78

0.8

0

k=6

Time (secs)

Ob

jective

Upper BoundsLower BoundsGlobal Minimum

0 100 200 300 400

0.0

00

.05

0.1

00

.15

k=6

Time (secs)

MIO

−G

ap

Time (secs) Time (secs)


Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)




Discrete First Order Method

I Stylized gradient based method for

minβ

g(β) s.t. ‖β‖0 ≤ k,

I g(β) convex and ‖∇g(β)−∇g(β0)‖ ≤ ` · ‖β − β0‖.

I This implies that for all L ≥ `

g(β) ≤ Q(β) = g(β0) + 〈∇g(β0),β − β0〉+L

2‖β − β0‖22

I For the purpose of finding feasible solutions, we propose

minβ

Q(β) s.t. ‖β‖0 ≤ k

[Related work: Blumensath, Davis ’08; Donoho, Johnstone ’95 ]


Solution

I Equivalent to

minβ

L

2

∥∥∥∥β − (β0 −1

L∇g(β0)

)∥∥∥∥2

2

s.t. ‖β‖0 ≤ k

I Reducing tominβ‖β − u‖22 s.t. ‖β‖0 ≤ k

I Optimal solution is β∗ ∈ Hk(u), where Hk(u) is thehard-thresholding operator (retains the top k entries of u inabsolute value).[Donoho & Johnstone ’95]


Discrete First Order Algorithm (DFA)

Algorithm to get feasible solutions for:

minβ

g(β) s.t. ‖β‖0 ≤ k.

1. Initialize with a solution β0; m = 0.

2. m := m+ 1.

3. βm+1 ∈ Hk

(βm − 1

L∇g(βm)).

4. Perform a line search to get βm+1

5. Repeat Steps 2-4 until ‖βm+1 − βm‖ ≤ ε.


Convergence properties

Theorem. (Bertsimas, King, M. ’16)

Let βm,m ≥ 1 be generated by DFA:

(a) For any L ≥ `, the sequence g(βm) is decreasing and converges.

(b) If L > ` and under some minor regularity properties

I ‖βm+1 − βm‖22 ≤ ε in at most O( 1ε ) many iterations.

I Supp(βm) stabilizes after finitely many iterations and βmconverges to a first order stationary point.


Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)




Special Ordered Sets-formulation

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ,z

‖y −Xβ‖22

s.t. (βi, 1− zi) : SOS type-1,i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p.


Implied Constraints

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ

‖y −Xβ‖22

s.t. ‖β‖0 ≤ k‖β‖∞ ≤ δ11, ‖β‖1 ≤ δ21

‖Xβ‖∞ ≤ δ12, ‖Xβ‖1 ≤ δ22

for constants δ11, δ12, δ21, δ22 (which can be computed from data).


Behavior with user-guided intelligence

Diabetes data: n = 350, p = 64.

0 20 40 60 80

−3

.5−

3.0

−2

.5−

2.0

−1

.5−

1.0

k=5

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start

0 500 1000 1500 2000 2500 3000 3500

−2

.5−

2.0

−1

.5−

1.0

k=31

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start


Statistical Behavior


Sparsity Detection for n = 500, p = 100

0

10

20

30

40

1.742 3.484 6.967Signal−to−Noise Ratio

Num

ber

of N

onze

ros

Method

MIO

Lasso

Step

Sparsenet


Prediction Error = ‖Xβalg −Xβtrue‖22/‖Xβtrue‖22

0.00

0.05

0.10

0.15

0.20

0.25

1.742 3.484 6.967Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

MIO

Lasso

Step

Sparsenet


Sparsity Detection for n = 50, p = 2000

0

10

20

30

40

3 7 10Signal−to−Noise Ratio

Num

ber

Non

zero

s

Method

Lasso

First Order + MIO

First Order Only

Sparsenet


Prediction Error for n = 50, p = 2000

0.0

0.1

0.2

3 7 10Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

Lasso

First Order + MIO

First Order Only

Sparsenet


What did we learn?

I For the case n > p, MIO+intelligence finds provably optimalsolutions for n = 500s, p = 100s in minutes.

I For the case n < p, MIO+intelligence finds solutions for n = 50s,p = 1000s in minutes and proving (approx)-optimality in hours.

I MIO solutions have a significant edge in sparsity and improvedprediction accuracy.

I Modern optimization (MIO+user guided intelligence) is capable oftackling large instances.


Outline








The Discrete Dantzig Selector

[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:

βDS`1 ∈ argmin ‖β‖1 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Instead, consider its `0 analogue:


I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization



[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:


I Instead, consider its `0 analogue:


I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization



Under a sparse linear model with Gaussian errors: y = Xβ∗ + ε

I The errors:

– ‖βDS`0− β∗‖22

– ‖βDS`0− β∗‖21

– ‖X(βDS`0− β∗)‖22

are much smaller than the convex estimator βDS`1

(when features arecorrelated)

I # Non-zeros βDS`0

# Non-zeros βDS`1

I Statistical properties of βDS`0

comparable with Least Squares SubsetSelection


Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer.


Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer (several hours).


Outline








Effect of Outliers in Regression

[Bertsimas, M., ’14, Annals of Statistics]

I Least Squares (LS) estimator

β(LS) ∈ argminβ

n∑i=1

r2i , ri = yi − x′iβ

has a breakdown point of zero (Dohono & Huber ’83; Hampel ’75).

I The Least Absolute Deviation (LAD) estimator has a breakdownpoint of zero

β(LAD) ∈ argminβ

n∑i=1

|ri|,

I M-Estimators (Huber ’73) slightly improve the breakdown point

n∑i=1

ρ(ri), ρ(r) symmetric function


Least Median Regression

I Least Median of Squares (LMS) estimator [Rousseew (’84)]

β(LMS) ∈ argminβ

(mediani=1,...,n

|ri|).

I LMS highest possible breakdown point of almost 50%.

I More generally, Least Quantile of Squares (LQS) estimator:

β(LQS) ∈ argminβ

|r(q)|,

where, r(q) is the qth ordered absolute residual:

|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.


Problem we address

I Solve the following problem:

minβ|r(q)|,

where, ri = yi − x′iβ, q is a quantile.

I Our approach extends to

minβ|r(q)|, s.t. Aβ ≤ b (and/or ‖β‖22 ≤ δ)


LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.







minβ|r(q)| = min

I∈Ωq


),








minβ|r(q)| = min

I∈Ωq


),



Overview of our approach

I Write the LMS problem as a MIO.— Main idea: MIO formulation sorts to express |r(q)|— Formulation very different from best subset selection inregression

I Using Discrete First Order methods we find good feasible solutions.

I Warm-starts and improved behavior with user-guided intelligence


MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.


MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.

MIO representable


MIO Formulation

min γ

s.t. |ri|+ µi ≥ γ, i = 1 . . . , n

γ ≥ |ri| − µi, i = 1 . . . , n

Muzi ≥ µi, i = 1, . . . , n

M`(1− zi) ≥ µi, i = 1, . . . , nn∑i=1

zi = q

µi ≥ 0, i = 1, . . . , n

µi ≥ 0, i = 1, . . . , n

zi ∈ 0, 1, i = 1, . . . , n,

where γ, zi, µi, µi, i = 1, . . . , n are decision variables and Mu,M` areBig-M constants.


What do we achieve?

I Prior exact algorithms can solve upto: n = 50 and p = 5

I We obtain:

I near optimal solutions for problems with n ≈ 200s and p ≈ 20sin seconds, proving optimality in minutes.

I near optimal solutions for problems with n ≈ 10, 000 andp ≈ 50 in minutes.


Outline








Background & Formulation

[Bertsimas, Copenhaver, M., ’16+]

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸︷︷︸

Θ

+ L2L′2︸︷︷︸

Small

+Φ

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0


Background & Formulation

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸︷︷︸

Θ

+ L2L′2︸︷︷︸

Small

+Φ

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Φ 0


Our Approach


Σ−Θ 0

(†)


Our Approach


Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)


Our Approach


Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)

I SDP with rank constraints

I Key Idea: Reformulate (†) equivalently as a SDP (without rankconstraint)

I Nonlinear Optimization techniques for feasible solutions

I Specialized Branch & Bound methods to certify optimality


Reformulation and tailored B&B


Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi

s.t. I W 0

Tr(W) = p− rΣ−Θ 0


Reformulation and tailored B&B


Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi ←−

Bilinear Form (Nonconvex)

McCormick Hulls/ B&B

s.t. I W 0

Tr(W) = p− rΣ−Θ 0


What do we learn?

I Several experiments on both real and synthetic datasets, reveal:

I Upper bounds obtained within few seconds (p = 100) toseveral minutes (p = 4000)

I Certifying optimality takes longer (several hours)

I Global optimality certificates obtained on datasets, where,assumptions required for convex problem to succeed cannot beverified.

ArXiv link: http://arxiv.org/pdf/1604.06837v1.pdf


http://arxiv.org/pdf/1604.06837v1.pdf

Conclusions

I MIO is an advanced, computationally tractable mathematicalprogramming framework

I Provides a powerful modeling tool for statistical problems

I Leads to a significant edge in Sparse Learning problems that areinherently discrete.

I 15.097: PhD class taught at MIT Spring 2016 on related topics.


Thank you!

All papers available at:http://www.mit.edu/~rahulmaz/research.html


http://www.mit.edu/~rahulmaz/research.html

Documents

Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute