69
Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute of Technology (Joint with D. Bertsimas, M. Copenhaver, A. King, P. Radchenko, H. Qin, J. Goetz, K. Khamaru) August, 2016 R. Mazumder Sparsity with Discrete Optimization 1

Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

  • Upload
    hoangtu

  • View
    226

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Problems in Sparse Multivariate Statistics with aDiscrete Optimization Lens

Rahul Mazumder

Massachusetts Institute of Technology

(Joint with D. Bertsimas, M. Copenhaver, A. King, P. Radchenko, H. Qin, J.

Goetz, K. Khamaru)

August, 2016

R. Mazumder Sparsity with Discrete Optimization 1

Page 2: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?

R. Mazumder Sparsity with Discrete Optimization 2

Page 3: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Motivation

I Several basic statistical estimation tasks are inherently discrete

I Often dismissed as computationally infeasible

I We often “relax” the hard problems:– Convex (continuous) optimization plays a key role (e.g. Lasso)– They work very well in many cases...

I However, often leads to a compromise in statistical performance

I Question: Can we use advances in discrete optimization to globallysolve nonconvex problems?

R. Mazumder Sparsity with Discrete Optimization 2

Page 4: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸ ︷︷ ︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours

R. Mazumder Sparsity with Discrete Optimization 3

Page 5: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Motivation

I We seldom know a-priori which method will work for a givenapplication

I “...A statistician’s toolkit should have a whole array of methods, toexperiment with...”

...Jerome. H. Friedman

I Use tools from mathematical optimization︸ ︷︷ ︸Discrete & Convex Optimization

to devise estimators:

I that are flexibleI have a disciplined computational framework:

– Obtain almost optimal solutions in seconds/minutes– Certify optimality in minutes/hours

R. Mazumder Sparsity with Discrete Optimization 3

Page 6: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 4

Page 7: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 5

Page 8: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 6

Page 9: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Statement

[Bertsimas, King, M., ’16, Annals of Statistics]

I Usual linear regression model n samples, p regressors

I Want a sparse β with good data-fidelity:

minβ

1

2‖y −Xβ‖22 s.t. ‖β‖0 ≤ k, (?)

[Miller ’90; Foster & George ’94; George ’00 ]

I Problem (?) is NP-hard [Natarajan ’95].

I R package leaps can handle n ≥ p ≤ 31.(branch and leaps [Furnival & Wilson 1974])

I Not surprisingly, advised to stay away from Problem (?).

R. Mazumder Sparsity with Discrete Optimization 7

Page 10: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Current Approaches & Limitations

I Lasso (`1) [Tibshirani ’96, Chen & Donoho ’98] is a very popular andeffective proxy:

minβ

12‖y −Xβ‖22 + λ‖β‖1,

I Computation: convex optimization, fast & scalable

I `1 =⇒ good models, under assumptions︸ ︷︷ ︸difficult to verify

I `1 6=⇒ reliable sparse solutions, and `1 6= `0 solutions.

[Buhlmann, Van de Geer ’11; Cai, Shen ’11; Zhang, Jiang ’08....]

R. Mazumder Sparsity with Discrete Optimization 8

Page 11: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Shortcomings of the Lasso: a simple explanation

I In presence of correlated variables, to obtain model with goodpredictive power, Lasso brings in a large number of nonzerocoefficients

I Lasso leads to biased estimates—`1-norm penalizes large and smallcoefficients uniformly.

I Upon increasing the degree of regularization, Lasso sets morecoefficients to zero—leaves out true predictors from the active set.

R. Mazumder Sparsity with Discrete Optimization 9

Page 12: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: `1 vs `0

I If β denotes the best subset solution, for any (fixed) X,

sup‖β∗‖0≤k

1

nE(‖Xβ −Xβ∗‖22) .

σ2k log p

n,

I If β`1 denotes a Lasso-based k-sparse estimator, then ∃ X:

1

γ2

σ2k1−δ log p

n. sup‖β∗‖0≤k

1

nE(‖Xβ`1 −Xβ∗‖22) .

1

γ2

σ2k log p

n,

I There is a significant gap between `0 and `1-type solutions.

[Bunea et. al. ’07; Raskutti et. al. ’09; Zhang et. al. ’14 ]

R. Mazumder Sparsity with Discrete Optimization 10

Page 13: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Current Approaches & Limitations

I To circumvent shortcomings, alternatives exist

I Non-convex penalties/ greedy methods[Fan, Li ’01; Zou ’06; Zou, Li ’08; Zhang ’10 ; Mazumder et. al. ’11; Zhang ,

Zhang ’12; Loh, Wainwright ’14]

I Problems are non-convex and hard to solve.

I Computational approaches mostly heuristic:cannot certify/prove global optimality for arbitrary dataset.Exception: [Liu , Yao , Li ’16]

R. Mazumder Sparsity with Discrete Optimization 11

Page 14: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Our approach

[Bertsimas, King, M., ’16, Annals of Statistics]

I Certifiably minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Main workhorses:

Tools from different branches of Optimization:

I Modern Technology of Mixed Integer Optimization (MIO)

I Discrete First Order methods (motivated from convexcontinuous optimization)

R. Mazumder Sparsity with Discrete Optimization 12

Page 15: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Our approach

I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 13

Page 16: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Best Subset Regression: Our approach

I Consider minβ

12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express as Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 14

Page 17: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Brief Background on MIO

R. Mazumder Sparsity with Discrete Optimization 15

Page 18: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Mixed Integer Optimation (MIO)

I MIO: a particular class of discrete optimization problems

I The general form of a Mixed Integer Quadratic Optimization:

min αTQα + αTa

s.t. Aα ≤ b

αi ∈ 0, 1, ∀i ∈ I

αj ∈ <+, ∀j /∈ I,

a ∈ <m,A ∈ <k×m,b ∈ <k and Q ∈ <m×m (PSD)problem-parameters;

I Special instances: Mixed Integer Linear Optimization,Quadratic/Linear Programming...

R. Mazumder Sparsity with Discrete Optimization 16

Page 19: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Page 20: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)

tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Page 21: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Mixed Integer Optimation (MIO)

I MIO optimization methods employ a combination of branch andbound, branch and cut, cutting plane methods, ...(not complete enumeration)

I Foundations deeply rooted in polyhedral theory, combinatorics,discrete geometry/algebra,...

I Worst case: NP hard. Our focus is not worst case analysis.(Simplex Algorithm, Path Algorithms like LARS, TSP, ...)

I Modern MIO is tractable (in practice)tractability: Ability to solve problems of realistic size in times thatare appropriate for the applications we consider.(successful applications: production planning, transportation, inventory

management, air-traffic control, warehouse location, matching assignments,...)

R. Mazumder Sparsity with Discrete Optimization 17

Page 22: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Progress of MIO

I Algorithms and Software have undergone huge improvements overpast 25+ years (1991 - 2016).

I Algorithms speed-up: ∼1.4 million times(Combined speedup: CPLEX 1.2 to 11 & Gurobi 1.0 to 6.5)

I Hardware speed-up: ∼1.6 million times(Peak Supercomputer performance)

I Total speed-up: 2.2 trillion times!

I Commercial packages: Xpress, Gurobi, Cplex,...

Non-commercial packages: GLPK, lpsolve, CBC, SCIP,...

Interfaces: Matlab, R, Python, Julia (JuMP)

R. Mazumder Sparsity with Discrete Optimization 18

Page 23: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Back to Formulation

R. Mazumder Sparsity with Discrete Optimization 19

Page 24: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Vanilla MIO formulation

For problem: minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k,

A simple (natural) MIO formulation is given by

minβ,z

12‖y −Xβ‖22

s.t. |βi| ≤M · zi, i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p,

where, M (“Big-M”) is a parameter— M ≥ ‖β‖∞— M controls the strength of the MIO formulation

R. Mazumder Sparsity with Discrete Optimization 20

Page 25: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Diabetes Dataset, n = 350, p = 64, k = 6

Typical behavior of Overall Algorithm

0 100 200 300 400

0.6

80

.70

0.7

20

.74

0.7

60

.78

0.8

0

k=6

Time (secs)

Ob

jective

Upper BoundsLower BoundsGlobal Minimum

0 100 200 300 400

0.0

00

.05

0.1

00

.15

k=6

Time (secs)

MIO

−G

ap

Time (secs) Time (secs)

R. Mazumder Sparsity with Discrete Optimization 21

Page 26: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 22

Page 27: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Discrete First Order Method

I Stylized gradient based method for

minβ

g(β) s.t. ‖β‖0 ≤ k,

I g(β) convex and ‖∇g(β)−∇g(β0)‖ ≤ ` · ‖β − β0‖.

I This implies that for all L ≥ `

g(β) ≤ Q(β) = g(β0) + 〈∇g(β0),β − β0〉+L

2‖β − β0‖22

I For the purpose of finding feasible solutions, we propose

minβ

Q(β) s.t. ‖β‖0 ≤ k

[Related work: Blumensath, Davis ’08; Donoho, Johnstone ’95 ]

R. Mazumder Sparsity with Discrete Optimization 23

Page 28: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Solution

I Equivalent to

minβ

L

2

∥∥∥∥β − (β0 −1

L∇g(β0)

)∥∥∥∥2

2

s.t. ‖β‖0 ≤ k

I Reducing tominβ‖β − u‖22 s.t. ‖β‖0 ≤ k

I Optimal solution is β∗ ∈ Hk(u), where Hk(u) is thehard-thresholding operator (retains the top k entries of u inabsolute value).[Donoho & Johnstone ’95]

R. Mazumder Sparsity with Discrete Optimization 24

Page 29: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Discrete First Order Algorithm (DFA)

Algorithm to get feasible solutions for:

minβ

g(β) s.t. ‖β‖0 ≤ k.

1. Initialize with a solution β0; m = 0.

2. m := m+ 1.

3. βm+1 ∈ Hk

(βm − 1

L∇g(βm)).

4. Perform a line search to get βm+1

5. Repeat Steps 2-4 until ‖βm+1 − βm‖ ≤ ε.

R. Mazumder Sparsity with Discrete Optimization 25

Page 30: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Convergence properties

Theorem. (Bertsimas, King, M. ’16)

Let βm,m ≥ 1 be generated by DFA:

(a) For any L ≥ `, the sequence g(βm) is decreasing and converges.

(b) If L > ` and under some minor regularity properties

I ‖βm+1 − βm‖22 ≤ ε in at most O( 1ε ) many iterations.

I Supp(βm) stabilizes after finitely many iterations and βmconverges to a first order stationary point.

R. Mazumder Sparsity with Discrete Optimization 26

Page 31: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Our approach

I Consider minβ12‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

I Express best-subset as a Mixed Integer Optimization problem (MIO)

I Discrete First Order methods for advanced warm-starts

I Enhancing MIO: Stronger Formulations

R. Mazumder Sparsity with Discrete Optimization 27

Page 32: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Special Ordered Sets-formulation

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ,z

‖y −Xβ‖22

s.t. (βi, 1− zi) : SOS type-1,i = 1, . . . , pp∑i=1

zi ≤ k

zi ∈ 0, 1, i = 1, . . . , p.

R. Mazumder Sparsity with Discrete Optimization 28

Page 33: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Implied Constraints

minβ

‖y −Xβ‖22 s.t. ‖β‖0 ≤ k

is equivalent to

minβ

‖y −Xβ‖22

s.t. ‖β‖0 ≤ k‖β‖∞ ≤ δ11, ‖β‖1 ≤ δ21

‖Xβ‖∞ ≤ δ12, ‖Xβ‖1 ≤ δ22

for constants δ11, δ12, δ21, δ22 (which can be computed from data).

R. Mazumder Sparsity with Discrete Optimization 29

Page 34: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Behavior with user-guided intelligence

Diabetes data: n = 350, p = 64.

0 20 40 60 80

−3

.5−

3.0

−2

.5−

2.0

−1

.5−

1.0

k=5

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start

0 500 1000 1500 2000 2500 3000 3500

−2

.5−

2.0

−1

.5−

1.0

k=31

Time (secs)

log

(MIO

−G

ap

)

Warm StartCold Start

R. Mazumder Sparsity with Discrete Optimization 30

Page 35: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Statistical Behavior

R. Mazumder Sparsity with Discrete Optimization 31

Page 36: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Sparsity Detection for n = 500, p = 100

0

10

20

30

40

1.742 3.484 6.967Signal−to−Noise Ratio

Num

ber

of N

onze

ros

Method

MIO

Lasso

Step

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 32

Page 37: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Prediction Error = ‖Xβalg −Xβtrue‖22/‖Xβtrue‖22

0.00

0.05

0.10

0.15

0.20

0.25

1.742 3.484 6.967Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

MIO

Lasso

Step

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 33

Page 38: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Sparsity Detection for n = 50, p = 2000

0

10

20

30

40

3 7 10Signal−to−Noise Ratio

Num

ber

Non

zero

s

Method

Lasso

First Order + MIO

First Order Only

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 34

Page 39: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Prediction Error for n = 50, p = 2000

0.0

0.1

0.2

3 7 10Signal−to−Noise Ratio

Pre

dict

ion

Per

form

ance

Method

Lasso

First Order + MIO

First Order Only

Sparsenet

R. Mazumder Sparsity with Discrete Optimization 35

Page 40: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

What did we learn?

I For the case n > p, MIO+intelligence finds provably optimalsolutions for n = 500s, p = 100s in minutes.

I For the case n < p, MIO+intelligence finds solutions for n = 50s,p = 1000s in minutes and proving (approx)-optimality in hours.

I MIO solutions have a significant edge in sparsity and improvedprediction accuracy.

I Modern optimization (MIO+user guided intelligence) is capable oftackling large instances.

R. Mazumder Sparsity with Discrete Optimization 36

Page 41: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 37

Page 42: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

The Discrete Dantzig Selector

[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:

βDS`1 ∈ argmin ‖β‖1 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Instead, consider its `0 analogue:

βDS`0 ∈ argmin ‖β‖0 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization

R. Mazumder Sparsity with Discrete Optimization 38

Page 43: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

The Discrete Dantzig Selector

[M. & Radchenko ’16+]

I The Dantzig Selector [Candes, Tao ’07]:

βDS`1 ∈ argmin ‖β‖1 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Instead, consider its `0 analogue:

βDS`0 ∈ argmin ‖β‖0 s.t. ‖X′(y −Xβ)‖∞ ≤ δ

I Find the sparsest β such that maximal (abs) correlation betweencovariates and residuals is small.

I Why is this important?– Formulation is a Mixed Integer Linear Optimization.

– Mixed Integer Linear is a more mature technology than MixedInteger Quadratic Optimization

R. Mazumder Sparsity with Discrete Optimization 38

Page 44: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

The Discrete Dantzig Selector

Under a sparse linear model with Gaussian errors: y = Xβ∗ + ε

I The errors:

– ‖βDS`0− β∗‖22

– ‖βDS`0− β∗‖21

– ‖X(βDS`0− β∗)‖22

are much smaller than the convex estimator βDS`1

(when features arecorrelated)

I # Non-zeros βDS`0

# Non-zeros βDS`1

I Statistical properties of βDS`0

comparable with Least Squares SubsetSelection

R. Mazumder Sparsity with Discrete Optimization 39

Page 45: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer.

R. Mazumder Sparsity with Discrete Optimization 40

Page 46: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Some Large Problems

(Synthetic Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

4,000 8,000 20 20 20 0 41.93,000 8,000 20 20 20 0 18.31,000 10,000 10 10 10 0 14.25,000 10,000 10 10 10 0 2.5

10,000 10,000 30 30 27 10% 42.5

(Real Data Examples)n p k∗ Upper Bound Lower Bound MIO Gap Time to Prove Opt

6,000 4,500 20 20 20 0 5.06,000 4,500 40 40 37 10% 12.5

Table: Solutions obtained within 5-10 minutes for all problems. CertifyingOptimality takes longer (several hours).

R. Mazumder Sparsity with Discrete Optimization 41

Page 47: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 42

Page 48: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Effect of Outliers in Regression

[Bertsimas, M., ’14, Annals of Statistics]

I Least Squares (LS) estimator

β(LS) ∈ argminβ

n∑i=1

r2i , ri = yi − x′iβ

has a breakdown point of zero (Dohono & Huber ’83; Hampel ’75).

I The Least Absolute Deviation (LAD) estimator has a breakdownpoint of zero

β(LAD) ∈ argminβ

n∑i=1

|ri|,

I M-Estimators (Huber ’73) slightly improve the breakdown point

n∑i=1

ρ(ri), ρ(r) symmetric function

R. Mazumder Sparsity with Discrete Optimization 43

Page 49: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Least Median Regression

I Least Median of Squares (LMS) estimator [Rousseew (’84)]

β(LMS) ∈ argminβ

(mediani=1,...,n

|ri|).

I LMS highest possible breakdown point of almost 50%.

I More generally, Least Quantile of Squares (LQS) estimator:

β(LQS) ∈ argminβ

|r(q)|,

where, r(q) is the qth ordered absolute residual:

|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

R. Mazumder Sparsity with Discrete Optimization 44

Page 50: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Problem we address

I Solve the following problem:

minβ|r(q)|,

where, ri = yi − x′iβ, q is a quantile.

I Our approach extends to

minβ|r(q)|, s.t. Aβ ≤ b (and/or ‖β‖22 ≤ δ)

R. Mazumder Sparsity with Discrete Optimization 45

Page 51: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

Page 52: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

Page 53: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

LQS and Subset Selection: A surprising link

I LQS and subset-selection in regression seem to be completelyunrelated concepts...

I However, a curious link emerges...

I Claim: LQS is performing an implicit subset search

Theorem [Bertsimas & M. ’14]: The LQS problem is equivalent to thefollowing:

minβ|r(q)| = min

I∈Ωq

(minβ‖yI −XIβ‖∞

),

where, Ωq := I : I ⊂ 1, . . . , n, |I| = q and (yI ,XI) denotes thesubsample (yi,xi), i ∈ I.

R. Mazumder Sparsity with Discrete Optimization 46

Page 54: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Overview of our approach

I Write the LMS problem as a MIO.— Main idea: MIO formulation sorts to express |r(q)|— Formulation very different from best subset selection inregression

I Using Discrete First Order methods we find good feasible solutions.

I Warm-starts and improved behavior with user-guided intelligence

R. Mazumder Sparsity with Discrete Optimization 47

Page 55: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.

R. Mazumder Sparsity with Discrete Optimization 48

Page 56: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

MIO Formulation

Notation:|r(1)| ≤ |r(2)| ≤ . . . ≤ |r(n)|.

Step 1: Introduce binary variables zi, i = 1, . . . , n such that:

zi =

1, if |ri| ≤ |r(q)|,0, otherwise.

Step 2: Use auxiliary continuous variables µi, µi ≥ 0 such that:

|ri| − µi ≤ |r(q)| ≤ |ri|+ µi, i = 1, . . . , n,

with the conditions:

If |ri| ≥ |r(q)|, then µi = 0, µi ≥ 0,

and if |ri| ≤ |r(q)|, then µi = 0, µi ≥ 0.

MIO representable

R. Mazumder Sparsity with Discrete Optimization 49

Page 57: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

MIO Formulation

min γ

s.t. |ri|+ µi ≥ γ, i = 1 . . . , n

γ ≥ |ri| − µi, i = 1 . . . , n

Muzi ≥ µi, i = 1, . . . , n

M`(1− zi) ≥ µi, i = 1, . . . , nn∑i=1

zi = q

µi ≥ 0, i = 1, . . . , n

µi ≥ 0, i = 1, . . . , n

zi ∈ 0, 1, i = 1, . . . , n,

where γ, zi, µi, µi, i = 1, . . . , n are decision variables and Mu,M` areBig-M constants.

R. Mazumder Sparsity with Discrete Optimization 50

Page 58: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

What do we achieve?

I Prior exact algorithms can solve upto: n = 50 and p = 5

I We obtain:

I near optimal solutions for problems with n ≈ 200s and p ≈ 20sin seconds, proving optimality in minutes.

I near optimal solutions for problems with n ≈ 10, 000 andp ≈ 50 in minutes.

R. Mazumder Sparsity with Discrete Optimization 51

Page 59: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Outline

I Best Subset Selection in Regression [Mallows ’66, Miller ’90]

— Least Squares Variable Selection— Discrete Dantzig Selector— Grouped Variable Selection and Sparse Additive Models

I Robust Linear Regression [Rousseeuw ’83]

— Least Median of Squares Regression

I Low rank Factor Analysis [Spearman ’04]

— Least Squares Factor Analysis— Maximum Likelihood Factor Analysis

R. Mazumder Sparsity with Discrete Optimization 52

Page 60: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Background & Formulation

[Bertsimas, Copenhaver, M., ’16+]

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸ ︷︷ ︸

Θ

+ L2L′2︸ ︷︷ ︸

Small

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0

R. Mazumder Sparsity with Discrete Optimization 53

Page 61: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Background & Formulation

Low Rank Factor Analysis (FA) [Spearman 1904]:

I widely used in multivariate statistics, econometrics, psychometrics

I represent correlation structure with few common (latent) factors.

Estimation Problem: Σ = L1L′1︸ ︷︷ ︸

Θ

+ L2L′2︸ ︷︷ ︸

Small

− Σ ≈ Θ + Φ

− Φ = diag(Φ1, . . . ,Φp) 0

− rank(Θ) ≤ r,Θ 0

− Σ−Θ 0; Σ−Φ 0

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Φ 0

R. Mazumder Sparsity with Discrete Optimization 54

Page 62: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

(†)

R. Mazumder Sparsity with Discrete Optimization 55

Page 63: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)

R. Mazumder Sparsity with Discrete Optimization 55

Page 64: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Our Approach

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

← Sum of Singular Values

← Rank Constraint

← Semidefinite Constraint

(†)

I SDP with rank constraints

I Key Idea: Reformulate (†) equivalently as a SDP (without rankconstraint)

I Nonlinear Optimization techniques for feasible solutions

I Specialized Branch & Bound methods to certify optimality

R. Mazumder Sparsity with Discrete Optimization 55

Page 65: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Reformulation and tailored B&B

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi

s.t. I W 0

Tr(W) = p− rΣ−Θ 0

R. Mazumder Sparsity with Discrete Optimization 56

Page 66: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Reformulation and tailored B&B

min ‖Σ− (Θ + Φ)‖s.t. rank(Θ) ≤ r

Σ−Θ 0

~w Variational Representation

of Spectral Functions

min 〈W,Σ−Θ〉 −p∑i=1

wiiΦi ←−

Bilinear Form (Nonconvex)

McCormick Hulls/ B&B

s.t. I W 0

Tr(W) = p− rΣ−Θ 0

R. Mazumder Sparsity with Discrete Optimization 57

Page 67: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

What do we learn?

I Several experiments on both real and synthetic datasets, reveal:

I Upper bounds obtained within few seconds (p = 100) toseveral minutes (p = 4000)

I Certifying optimality takes longer (several hours)

I Global optimality certificates obtained on datasets, where,assumptions required for convex problem to succeed cannot beverified.

ArXiv link: http://arxiv.org/pdf/1604.06837v1.pdf

R. Mazumder Sparsity with Discrete Optimization 58

Page 68: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Conclusions

I MIO is an advanced, computationally tractable mathematicalprogramming framework

I Provides a powerful modeling tool for statistical problems

I Leads to a significant edge in Sparse Learning problems that areinherently discrete.

I 15.097: PhD class taught at MIT Spring 2016 on related topics.

R. Mazumder Sparsity with Discrete Optimization 59

Page 69: Problems in Sparse Multivariate Statistics with a Discrete ...€¦ · Problems in Sparse Multivariate Statistics with a Discrete Optimization Lens Rahul Mazumder Massachusetts Institute

Thank you!

All papers available at:http://www.mit.edu/~rahulmaz/research.html

R. Mazumder Sparsity with Discrete Optimization 60