Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros

Primal-Dual Block Generalized Frank-Wolfe

Qi Lei∗, Jiacheng Zhuo∗, Constantine Caramanis∗, Inderjit S. Dhillon∗,†

and Alexandros G. Dimakis∗.

∗ University of Texas at Austin† Amazon

NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 1 / 11

Problem Setup

Convex-concave saddle point Problem (with constraints):

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Why is this formulation important?

1 Many machine learning applications


Machine Learning Applications with Convex-ConcaveFormulations

Empirical Risk Minimization

Reinforcement Learning Robust Optimization

(Du et al. 2017) (Ben-Tal et al., 2009)NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 3 / 11

Problem Setup

Convex-Concave Saddle Point Problem:

minx∈Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Why is this formulation important?

1 Many machine learning applications

2 To exploit special structure induced by the constraints


Observations and Challenges On Frank-Wolfe algorithm

Lessons from simple constrained minimization problems:

Observations.

Frank-Wolfe conducts partial updates:1. For `1 ball constraint, FW conducts 1-sparse update2. For nuclear norm ball constraint, FW conducts rank-1 update

Challenges to get full benefits from FW and the partial updates.

1. FW yield sublinear convergence even for strongly convex problems2. Even with partial updates, FW requires to compute the full gradient.(For big data setting, per iteration complexity is the same withprojected gradient descent. )


Tackle challenge 1: To achieve linear convergence

Continue to look at simple minimization problems:

minx∈Rd ,‖x‖1≤τ

{f (x)}

Method iteration complexity # update per iteration

Projected GD κ log 1e d (feature dimension)

Frank-Wolfe 1e 1

Ours κ log 1e s (optimal sparsity)


Tackle challenge 1: block Frank-Wolfe

1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:

ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

FW: ∆xt ← argmin‖∆x‖1≤τ

{〈∇f (xt),∆x〉}

ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s

{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt

6: end for7: Output: xT





{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}


{〈∇f (xt),∆x〉}


{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt






{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}


{〈∇f (xt),∆x〉}


{〈∇f (xt),∆x〉+β

2η‖∆x − xt‖2

2}

5:

xt+1 ← (1− η)xt + η∆xt



Tackle challenge 2: reduce iteration complexity frompartial updates

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Write w = Ax and z = A>y .For each iteration,

Operation cost

Compute full gradient ∂xL = A>y + f ′(x) O(nd)Conduct BlockFW on x to find s-sparse update ∆x O(d)

x+ ← (1− η)x + η∆x O(d)Greedy block-k coordinate ascent for y O(nd)

Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be

maintained with the bilinear form


Tackle challenge 2: reduce iteration complexity frompartial updates

minx∈C⊂Rd

maxy∈Rn

{L(x , y) = f (x) + y>Ax − g(y)

}Maintain w = Ax and z = A>y .For each iteration,

Operation cost

Compute full gradient ∂xL = z + f ′(x) O(d)Conduct BlockFW on x to find s-sparse update ∆x O(d)

x+ ← (1− η)x + η∆x O(d)w+ ← (1− η)w + ηA∆x O(sn)

Greedy block-k coordinate ascent for y and z O(kd)

Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be

maintained with the bilinear formNeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 8 / 11

Time complexity comparisons

Algorithm Per Iteration Cost Iteration Complexity

Frank Wolfe O(nd) O( 1ε )

Accelerated PGD O(nd) O(√κ log 1

ε )(Nesterov et al. 2013)

SVRG (Rie et al. 2013) O(nd) O((1 + κ/n) log 1ε )

SCGS (Lan et al. 2016) O(κ2 #iter3

ε2 d) O( 1ε )

STORC (Hazan et al. 2016) O(κ2d + nd) O(log 1ε )

Primal Dual FW (ours) O(ns) O((1 + κ/n) log 1ε )

Remark 1: s is the sparsity of primal optimal induced by `1 constraint.Remark 2: for algorithm and complexity for nuclear norm constraints, refer

to our paper to details.


Experiments

Compared methods: (1) Accelerated ProjectedGradient Descent (Acc PG) (2)

Frank-Wolfe algorithm (FW) (3) Stochastic Variance ReducedGradient (SVRG) (4)

Stochastic Conditional Gradient Sliding (SCGS) and (5) StochasticVariance-Reduced

Conditional Gradient Sliding (STORC)


Documents

Primal-Dual Block Generalized Frank-Wolfeleiqi/PDBFW.pdfPrimal-Dual Block Generalized Frank-Wolfe Qi Lei, Jiacheng Zhuo, Constantine Caramanis, Inderjit S. Dhillon;y and Alexandros