13
Primal-Dual Block Generalized Frank-Wolfe Qi Lei * , Jiacheng Zhuo * , Constantine Caramanis * , Inderjit S. Dhillon *,and Alexandros G. Dimakis * . * University of Texas at Austin Amazon NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 1 / 11

Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Primal-Dual Block Generalized Frank-Wolfe

Qi Lei∗, Jiacheng Zhuo∗, Constantine Caramanis∗, Inderjit S. Dhillon∗,†

and Alexandros G. Dimakis∗.

∗ University of Texas at Austin† Amazon

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 1 / 11

Page 2: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Problem Setup

Convex-concave saddle point Problem (with constraints):

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Why is this formulation important?

1 Many machine learning applications

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 2 / 11

Page 3: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Machine Learning Applications with Convex-ConcaveFormulations

Empirical Risk Minimization

Reinforcement Learning Robust Optimization

(Du et al. 2017) (Ben-Tal et al., 2009)NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 3 / 11

Page 4: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Problem Setup

Convex-Concave Saddle Point Problem:

minx∈Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Why is this formulation important?

1 Many machine learning applications

2 To exploit special structure induced by the constraints

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 4 / 11

Page 5: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Observations and Challenges On Frank-Wolfe algorithm

Lessons from simple constrained minimization problems:

Observations.

Frank-Wolfe conducts partial updates:1. For `1 ball constraint, FW conducts 1-sparse update2. For nuclear norm ball constraint, FW conducts rank-1 update

Challenges to get full benefits from FW and the partial updates.

1. FW yield sublinear convergence even for strongly convex problems2. Even with partial updates, FW requires to compute the full gradient.(For big data setting, per iteration complexity is the same withprojected gradient descent. )

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 5 / 11

Page 6: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 1: To achieve linear convergence

Continue to look at simple minimization problems:

minx∈Rd ,‖x‖1≤τ

{f (x)}

Method iteration complexity # update per iteration

Projected GD κ log 1e d (feature dimension)

Frank-Wolfe 1e 1

Ours κ log 1e s (optimal sparsity)

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 6 / 11

Page 7: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 1: block Frank-Wolfe

1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:

ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

FW: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉}

ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt

6: end for7: Output: xT

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11

Page 8: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 1: block Frank-Wolfe

1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:

ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

FW: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉}

ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt

6: end for7: Output: xT

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11

Page 9: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 1: block Frank-Wolfe

1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:

ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

FW: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉}

ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt

6: end for7: Output: xT

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11

Page 10: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 2: reduce iteration complexity frompartial updates

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Write w = Ax and z = A>y .For each iteration,

Operation cost

Compute full gradient ∂xL = A>y + f ′(x) O(nd)Conduct BlockFW on x to find s-sparse update ∆x O(d)

x+ ← (1− η)x + η∆x O(d)Greedy block-k coordinate ascent for y O(nd)

Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be

maintained with the bilinear form

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 8 / 11

Page 11: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Tackle challenge 2: reduce iteration complexity frompartial updates

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Maintain w = Ax and z = A>y .For each iteration,

Operation cost

Compute full gradient ∂xL = z + f ′(x) O(d)Conduct BlockFW on x to find s-sparse update ∆x O(d)

x+ ← (1− η)x + η∆x O(d)w+ ← (1− η)w + ηA∆x O(sn)

Greedy block-k coordinate ascent for y and z O(kd)

Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be

maintained with the bilinear formNeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 8 / 11

Page 12: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Time complexity comparisons

Algorithm Per Iteration Cost Iteration Complexity

Frank Wolfe O(nd) O( 1ε )

Accelerated PGD O(nd) O(√κ log 1

ε )(Nesterov et al. 2013)

SVRG (Rie et al. 2013) O(nd) O((1 + κ/n) log 1ε )

SCGS (Lan et al. 2016) O(κ2 #iter3

ε2 d) O( 1ε )

STORC (Hazan et al. 2016) O(κ2d + nd) O(log 1ε )

Primal Dual FW (ours) O(ns) O((1 + κ/n) log 1ε )

Remark 1: s is the sparsity of primal optimal induced by `1 constraint.Remark 2: for algorithm and complexity for nuclear norm constraints, refer

to our paper to details.

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 9 / 11

Page 13: Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Experiments

Compared methods: (1) Accelerated ProjectedGradient Descent (Acc PG) (2)

Frank-Wolfe algorithm (FW) (3) Stochastic Variance ReducedGradient (SVRG) (4)

Stochastic Conditional Gradient Sliding (SCGS) and (5) StochasticVariance-Reduced

Conditional Gradient Sliding (STORC)

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 10 / 11