Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Primal-Dual Block Generalized Frank-Wolfe
Qi Lei∗, Jiacheng Zhuo∗, Constantine Caramanis∗, Inderjit S. Dhillon∗,†
and Alexandros G. Dimakis∗.
∗ University of Texas at Austin† Amazon
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 1 / 11
Problem Setup
Convex-concave saddle point Problem (with constraints):
minx∈C⊂Rd
maxy∈Rn
{L(x , y) = f (x) + y>Ax − g(y)
}Why is this formulation important?
1 Many machine learning applications
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 2 / 11
Machine Learning Applications with Convex-ConcaveFormulations
Empirical Risk Minimization
Reinforcement Learning Robust Optimization
(Du et al. 2017) (Ben-Tal et al., 2009)NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 3 / 11
Problem Setup
Convex-Concave Saddle Point Problem:
minx∈Rd
maxy∈Rn
{L(x , y) = f (x) + y>Ax − g(y)
}Why is this formulation important?
1 Many machine learning applications
2 To exploit special structure induced by the constraints
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 4 / 11
Observations and Challenges On Frank-Wolfe algorithm
Lessons from simple constrained minimization problems:
Observations.
Frank-Wolfe conducts partial updates:1. For `1 ball constraint, FW conducts 1-sparse update2. For nuclear norm ball constraint, FW conducts rank-1 update
Challenges to get full benefits from FW and the partial updates.
1. FW yield sublinear convergence even for strongly convex problems2. Even with partial updates, FW requires to compute the full gradient.(For big data setting, per iteration complexity is the same withprojected gradient descent. )
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 5 / 11
Tackle challenge 1: To achieve linear convergence
Continue to look at simple minimization problems:
minx∈Rd ,‖x‖1≤τ
{f (x)}
Method iteration complexity # update per iteration
Projected GD κ log 1e d (feature dimension)
Frank-Wolfe 1e 1
Ours κ log 1e s (optimal sparsity)
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 6 / 11
Tackle challenge 1: block Frank-Wolfe
1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:
ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
FW: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉}
ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
5:
xt+1 ← (1− η)xt + η∆xt
6: end for7: Output: xT
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11
Tackle challenge 1: block Frank-Wolfe
1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:
ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
FW: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉}
ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
5:
xt+1 ← (1− η)xt + η∆xt
6: end for7: Output: xT
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11
Tackle challenge 1: block Frank-Wolfe
1: Input: Data matrix A ∈ Rn×d , label matrix b, iteration T .2: Initialize: x1 ← 0.3: for t = 1, 2, · · · ,T − 1 do4:
ProjectedGD: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
FW: ∆xt ← argmin‖∆x‖1≤τ
{〈∇f (xt),∆x〉}
ours: ∆xt ← argmin‖∆x‖1≤τ,‖∆x‖0≤s
{〈∇f (xt),∆x〉+β
2η‖∆x − xt‖2
2}
5:
xt+1 ← (1− η)xt + η∆xt
6: end for7: Output: xT
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 7 / 11
Tackle challenge 2: reduce iteration complexity frompartial updates
minx∈C⊂Rd
maxy∈Rn
{L(x , y) = f (x) + y>Ax − g(y)
}Write w = Ax and z = A>y .For each iteration,
Operation cost
Compute full gradient ∂xL = A>y + f ′(x) O(nd)Conduct BlockFW on x to find s-sparse update ∆x O(d)
x+ ← (1− η)x + η∆x O(d)Greedy block-k coordinate ascent for y O(nd)
Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be
maintained with the bilinear form
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 8 / 11
Tackle challenge 2: reduce iteration complexity frompartial updates
minx∈C⊂Rd
maxy∈Rn
{L(x , y) = f (x) + y>Ax − g(y)
}Maintain w = Ax and z = A>y .For each iteration,
Operation cost
Compute full gradient ∂xL = z + f ′(x) O(d)Conduct BlockFW on x to find s-sparse update ∆x O(d)
x+ ← (1− η)x + η∆x O(d)w+ ← (1− η)w + ηA∆x O(sn)
Greedy block-k coordinate ascent for y and z O(kd)
Remark 1: take k = ns/d the iteration complexity is O(sn).Remark 2: the advantage comes from the fact that gradient could be
maintained with the bilinear formNeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 8 / 11
Time complexity comparisons
Algorithm Per Iteration Cost Iteration Complexity
Frank Wolfe O(nd) O( 1ε )
Accelerated PGD O(nd) O(√κ log 1
ε )(Nesterov et al. 2013)
SVRG (Rie et al. 2013) O(nd) O((1 + κ/n) log 1ε )
SCGS (Lan et al. 2016) O(κ2 #iter3
ε2 d) O( 1ε )
STORC (Hazan et al. 2016) O(κ2d + nd) O(log 1ε )
Primal Dual FW (ours) O(ns) O((1 + κ/n) log 1ε )
Remark 1: s is the sparsity of primal optimal induced by `1 constraint.Remark 2: for algorithm and complexity for nuclear norm constraints, refer
to our paper to details.
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 9 / 11
Experiments
Compared methods: (1) Accelerated ProjectedGradient Descent (Acc PG) (2)
Frank-Wolfe algorithm (FW) (3) Stochastic Variance ReducedGradient (SVRG) (4)
Stochastic Conditional Gradient Sliding (SCGS) and (5) StochasticVariance-Reduced
Conditional Gradient Sliding (STORC)
NeurIPS19 Primal-Dual Block Generalized Frank-Wolfe 10 / 11