91
Robust estimation via (non)convex M-estimation Po-Ling Loh University of Wisconsin - Madison Departments of ECE & Statistics Workshop on computational efficiency and high-dimensional robust statistics TTI Chicago August 15, 2018 Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 1 / 39

Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Robust estimation via (non)convex M-estimation

Po-Ling Loh

University of Wisconsin - MadisonDepartments of ECE & Statistics

Workshop on computational efficiency andhigh-dimensional robust statistics

TTI Chicago

August 15, 2018

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 1 / 39

Page 2: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 2 / 39

Page 3: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 3 / 39

Page 4: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Page 5: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Statistical M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

}

in high dimensions, may be ill-conditioned, large solution space

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Page 6: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Regularized M-estimators

Prediction/regression problem: Observe {(xi , yi )}ni=1, estimate

β∗ = arg minβ∈Rp

E[`(β; xi , yi )], xi ∈ Rp, yi ∈ R

Regularized M-estimator:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi )

︸ ︷︷ ︸Ln(β)

+ρλ(β)

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 4 / 39

Page 7: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Page 8: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}

=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Page 9: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Page 10: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: `1-regularized OLS regression

Linear model: yi = xTi β∗ + εi , ‖β∗‖0 ≤ k

Low-dimensional M-estimator:

β̂OLS ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2

}=

(1

n

n∑

i=1

xixTi

)−1(1

n

n∑

i=1

yixi

)

High-dimensional regularized M-estimator:

β̂Lasso ∈ arg minβ

{1

n

n∑

i=1

(yi − xTi β)2 + λ‖β‖1}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 5 / 39

Page 11: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Page 12: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Page 13: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Sources of nonconvexity

May arise in loss or regularizer

Nonconvex loss used to correct bias, increase efficiency

Nonconvex regularizer used to reduce bias, achieve oracle result

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 6 / 39

Page 14: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39

Page 15: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: Errors-in-variables regression

Model:yi = xTi β

∗ + εi

observe {(zi , yi )}ni=1, infer β∗

OLS estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

(zTi β − yi )2 + λ‖β‖1

}

statistically inconsistent

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 7 / 39

Page 16: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Corrected losses

L. & Wainwright ’12 propose natural method for correcting loss forlinear regression:

β̂OLS ∈ arg minβ

{1

2βT

XTX

nβ − yXT

nβ + ρλ(β)

}

β̂corr ∈ arg minβ

{1

2βT Γ̂β − γ̂Tβ + ρλ(β)

}

(Γ̂, γ̂) estimators for (Cov(xi ),Cov(xi , yi )) based on {(zi , yi )}ni=1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 8 / 39

Page 17: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:

β̂corr ∈ arg minβ

{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39

Page 18: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Example: Additive noise

Additive noise: Z = X + W , use

Γ̂ =ZTZ

n− Σw , γ̂ =

ZT y

n

However, corrected objective nonconvex:

β̂corr ∈ arg minβ

{1

2βT(ZTZ

n− Σw

)β − yTZ

nβ + ρλ(β)

}

Fortunately, local optima have good properties

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 9 / 39

Page 19: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39

Page 20: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Nonconvex regularizers

`1 is “convexified” version of `0

1

0

`0

0

`1

But `1 penalizes larger coefficients more, causes solution bias

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 10 / 39

Page 21: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 11 / 39

Page 22: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Empirical benefits

Nonconvex regularizers show significant improvement (Breheny &

Huang ’11)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 12 / 39

Page 23: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39

Page 24: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Local vs. global optima

Optimization algorithms only guaranteed to find local optima(stationary points)

Statistical theory only guarantees consistency of global optima

�⇤b� e�

O

rk log p

n

!

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen nonconvexity smaller than curvature

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 13 / 39

Page 25: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Page 26: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Page 27: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Measures of closeness

Various measures of statistical consistency

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(β; xi , yi ) + ρλ(β)

}

Estimation: ‖β̂ − β∗‖ → 0

Prediction: 1n

∑ni=1 `(β̂; xi , yi )→ 0

Variable selection: supp(β̂)→ supp(β∗)

Interested in cases where ` and ρλ possibly nonconvex

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 14 / 39

Page 28: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Page 29: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Page 30: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Estimation/prediction consistency

Composite objective function

β̂ ∈ arg min‖β‖1≤R

{Ln(β) +

p∑

j=1

ρλ(βj)

}

Ln satisfies restricted strong convexity with curvature α

ρλ has bounded subgradient at 0, and ρλ(t) + µt2 convex

L. & Wainwright ’13: All stationary points of Ln(β) + ρλ(β) closewhen α > µ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 15 / 39

Page 31: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39

Page 32: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Geometric intuition

Population-level convexity, finite-sample nonconvexity

function view gradient view

�⇤

Ln

�⇤

rLn

Population-level objective L strongly convex, α > µ

RSC quantifies convergence rate of ∇Ln −→ ∇L

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 16 / 39

Page 33: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39

Page 34: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 17 / 39

Page 35: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Page 36: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Page 37: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Technical conditions

Requirements on loss and regularizer to ensure consistency ofstationary points

Restricted strong convexity of Ln

Bound on nonconvexity of ρλ

“Oracle” result under additional condition on ρλ

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 18 / 39

Page 38: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39

Page 39: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on Ln

Restricted strong convexity (Negahban et al. ’12):

〈∇Ln(β∗+∆)−∇Ln(β∗), ∆〉 ≥{α‖∆‖22 − τ log p

n ‖∆‖21, ∀‖∆‖2 ≤ r

α‖∆‖2 − τ√

log pn ‖∆‖1, o.w.

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Holds for various convex/nonconvex losses:OLS & corrected OLS for linear regression, log likelihood for GLMsHuber loss for robust regression

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 19 / 39

Page 40: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 41: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0

ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 42: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 43: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP

(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 44: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCP

Neither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 45: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Conditions on ρλ

Focus on amenable regularizers ρλ(β) =∑p

j=1 ρλ(βj) satisfying:

ρλ(0) = 0, symmetric around 0Nondecreasing on R+

t 7→ ρλ(t)t nonincreasing on R+

qλ(t) := λ|t| − ρλ(t) differentiable everywhereρλ(t) + µt2 convex for some µ > 0ρ′λ(t) = 0 for t ≥ γλ, for some γ > 0

Examples:

µ-amenable: `1, SCAD, MCP, LSP(µ, γ)-amenable: SCAD, MCPNeither: capped-`1, bridge penalty (`q, 0 < q < 1)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 20 / 39

Page 46: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Alternative regularizers

Various nonconvex regularizers in literature (Fan & Li ’01, Zhang ’10, etc.)

−4 −2 0 2 40

0.5

1

1.5

2

2.5

3

3.5

4

SCADMCPlqcapped l1l1

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 21 / 39

Page 47: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

Page 48: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

Page 49: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Statistical consistency

Regularized M-estimator

β̂ ∈ arg min‖β‖1≤R

{Ln(β) + ρλ(β)} ,

loss function satisfies (α, τ)-RSC and regularizer is µ-amenable

Theorem (L. & Wainwright ’13)

Suppose R is chosen s.t. β∗ is feasible, and λ satisfies

max

{‖∇Ln(β∗)‖∞, α

√log p

n

}- λ -

α

R.

For n ≥ Cτ2

α2 R2 log p, any stationary point β̃ satisfies

‖β̃ − β∗‖2 -λ√k

α− µ, where k = ‖β∗‖0.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 22 / 39

Page 50: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

More formally

�⇤b� e�

O

rk log p

n

!

Stationary points statistically indistinguishable from global optima

〈∇Ln(β̃) +∇ρλ(β̃), β − β̃〉 ≥ 0, ∀β feasible

Nonasymptotic rates: For λ �√

log pn and R � 1

λ ,

‖β̃ − β∗‖2 ≤ c

√k log p

n≈ statistical error

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 23 / 39

Page 51: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Outline

1 Regularized M-estimatorsStatistical M-estimationNonconvexityConsistency of local optima

2 High-dimensional robust regressionStatistical consistencyAsymptotic normalityTwo-step M-estimators

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 24 / 39

Page 52: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39

Page 53: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Robust regression functions

Linear model:yi = xTi β

∗ + εi

Heavy-tailed distribution on εi and/or outlier contamination

Use M-estimator

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi )

}

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 25 / 39

Page 54: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimator

Redescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

Page 55: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17

But bad for optimization!!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

Page 56: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Classes of loss functions

Bounded `′ limits influence of outliers:

IF ((x , y);T ,F ) = limt→0+

T ((1− t)F + tδ(x ,y))− T (F )

t

∝ `′(xTβ − y)x

where F ∼ Fβ and T minimizes M-estimatorRedescending M-estimators have finite rejection point:

`′(u) = 0, for |u| ≥ c

IntroductionRobust regression

ImplementationScottish hill races

Loss functions

−6 −4 −2 0 2 4 6

01

23

45

6

Residual

Loss

Least squaresAbsolute valueHuberTukey

Patrick Breheny BST 764: Applied Statistical Modeling 6/17But bad for optimization!!Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 26 / 39

Page 57: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39

Page 58: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

High-dimensional M-estimators

Natural idea: For p > n, use regularized version:

β̂ ∈ arg minβ

{1

n

n∑

i=1

`(xTi β − yi ) + λ‖β‖1}

Complications:

Optimization for nonconvex `?

Statistical theory? Are certain losses provably better than others?

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 27 / 39

Page 59: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Page 60: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Page 61: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Page 62: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Overview of results

When ‖`′‖∞ < C , global optima of high-dimensional M-estimatorsatisfy

‖β̂ − β∗‖2 ≤ C

√k log p

n,

regardless of distribution of εi

Compare to Lasso theory: Requires sub-Gaussian εi ’s

If `(u) is locally convex/smooth for |u| ≤ r , any local optima withinradius cr of β∗ satisfy

‖β̃ − β∗‖2 ≤ C ′√

k log p

n

Local optima may be obtained via two-step algorithm

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 28 / 39

Page 63: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Page 64: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Page 65: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Lasso analysis (e.g., van de Geer ’07, Bickel et al. ’08):

β̂ ∈ arg minβ

{1

n‖y − Xβ‖22 + λ‖β‖1︸ ︷︷ ︸

Ln(β)

}

Rearranging basic inequality Ln(β̂) ≤ Ln(β∗) and assuming

λ ≥ 2∥∥∥XT ε

n

∥∥∥∞

, obtain

‖β̂ − β∗‖2 ≤ cλ√k

Sub-Gaussian assumptions on xi ’s and εi ’s provide O(√

k log pn

)

bounds, minimax optimal

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 29 / 39

Page 66: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Page 67: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded

=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Page 68: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Theoretical insight

Key observation: For general loss function, if λ ≥ 2∥∥∥XT `′(ε)

n

∥∥∥∞

,

obtain‖β̂ − β∗‖2 ≤ cλ

√k

`′(ε) sub-Gaussian whenever `′ bounded=⇒ can achieve estimation error

‖β̂ − β∗‖2 ≤ c

√k log p

n,

without assuming εi is sub-Gaussian

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 30 / 39

Page 69: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.

Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39

Page 70: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Local statistical consistency

Local RSC condition: For ∆ := β1 − β2,

〈∇Ln(β1)−∇Ln(β2), ∆〉 ≥ α‖∆‖22−τlog p

n‖∆‖21, ∀‖βj −β∗‖2 ≤ r

How is such a result possible?

−1−0.5

00.5

1 −1−0.5

00.5

1−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Loss function has directions of both positive and negative curvature.

Negative directions are forbidden by regularizer.Only requires restricted curvature within constant-radius regionaround β∗

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 31 / 39

Page 71: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39

Page 72: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Consistency of local stationary points

�⇤b� e�

O

rk log p

n

!

r

Theorem (L. ’15)

Suppose Ln satisfies α-local RSC and ρλ is µ-amenable, with α > µ.Suppose (λ,R) are chosen appropriately. For n % τ

α−µk log p, any

stationary point β̃ s.t. ‖β̃ − β∗‖2 ≤ r satisfies

‖β̃ − β∗‖2 -λ√k

α− µ.

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 32 / 39

Page 73: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Page 74: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Page 75: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Page 76: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Second-order considerations

Question: If any bounded-derivative ` works for heavy-taileddistributions, why not always use Huber? LAD?

Answer: Has to do with (asymptotic) efficiency

In low-dimensional settings, MLE maximally efficient with respect tovariance

Although MLE may behave erratically when pn → (0, 1], can achieve

simple asymptotic normality results under sparsity assumption, viaoracle property

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 33 / 39

Page 77: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Asymptotic efficiency

n/(k log p)0 5 10 15

∥β̂−β∗∥ 2

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1l2-error for robust regression losses

p=128p=256p=512HuberCauchy

n/(k log p)10 11 12 13 14 15 16 17 18 19 20

empi

rical

var

ianc

e of

firs

t com

pone

nt

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35variance for robust regression losses

p=128p=256p=512HuberCauchy

`2-error and empirical variance of M-estimators when errors followCauchy distribution (SCAD regularizer)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 34 / 39

Page 78: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

Page 79: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

Page 80: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

How to obtain local efficient solutions?

Goal: Nonconvex regularized M-estimator such that ` satisfiesα-local RSC

Target: Locate stationary point within radius r of β∗

Descending ψ-functions are tricky, especially when thestarting values for the iterations are non-robust. . . . It istherefore preferable to start with a monotone ψ, iterate todeath, and then append a few (1 or 2) iterations with thenonmonotone ψ. — Huber 1981, pp. 191–192

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 35 / 39

Page 81: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Page 82: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Page 83: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Page 84: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Page 85: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Two-step algorithm (L. 15)

Use composite gradient descent starting from close initialization

Two-step M-estimator: Finds local stationary points of nonconvex,robust loss + (µ, γ)-amenable penalty

Algorithm

1 Run composite gradient descent on convex, robust loss + `1-penaltyuntil convergence, output β̂H

2 Run composite gradient descent on nonconvex, robust loss +(µ, γ)-amenable penalty, input β0 = β̂H

Theoretical guarantees on (rate of) convergence to optimal point

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 36 / 39

Page 86: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39

Page 87: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Summary of two-step M-estimator

Output is computationally and statistically efficient:

Computational guarantee on rate of convergence in each step ofM-estimation algorithm

Statistical guarantee on asymptotic efficiency of estimator (assumingβ-min condition and (µ, γ)-amenability)

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 37 / 39

Page 88: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

Page 89: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ bounded

Oracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

Page 90: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

Summary

Theory for nonconvex regularized M-estimators

Global RSC condition =⇒ all stationary points within statistical errorof β∗

Local RSC condition =⇒ stationary points in local region withinstatistical error of β∗

Consequences for high-dimensional robust regression estimators

Consistency under relaxed distributional assumptions when `′ boundedOracle estimator with (µ, γ)-amenable regularizer=⇒ asymptotic efficiency

Two-step M-estimator produces local oracle solutions

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 38 / 39

Page 91: Robust estimation via (non)convex M-estimation · 2018. 8. 25. · Outline 1 Regularized M-estimators Statistical M-estimation Nonconvexity Consistency of local optima 2 High-dimensional

References

P. Loh and M. J. Wainwright (2015). Regularized M-estimators withnonconvexity: Statistical and algorithmic theory for local optima.Journal of Machine Learning Research.

P. Loh (2017). Statistical consistency and asymptotic normality forhigh-dimensional robust M-estimators. Annals of Statistics.

Thank you!

Po-Ling Loh (UW-Madison) (Non)convex M-estimation Aug 15, 2018 39 / 39