1. Doubly Accelerated
Stochastic Variance Reduced Gradient Methods
for Regularized Empirical Risk Minimization
Tomoya Murata†
, Taiji Suzukiद
†NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO
Jan. 13, 2018
1 / 39
2. This Presentation
Murata and Suzuki:
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
Method for Regularized Empirical Risk Minimization, NIPS 2017
+ some extensions
2 / 39
3. Overview
What:
New methods for solving convex composite optimization in
mini-batch settings
Main result:
Improvement of the mini-batch efficiency of previous methods
− Mini-batch efficiency
: We say that A is more mini-batch efficient than B, if A’s
necessary mini-batch size for achieving given iteration complexity
is smaller than B’s.
− Iteration complexity
: Necessary number of parameter updates to achieve a desired
optimization error
3 / 39
5. Smoothness
Definition :
We say that f : Rd
→ R is (L, ℓ)-smooth (L > 0) if
−
ℓ
2
∥x − y∥2
≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤
L
2
∥x − y∥2
.
− Lower smoothness ℓ ≤ 0 implies (strong) convexity of f
− Lower smoothness ℓ > 0 implies non-convexity of f
5 / 39
6. Convex Composite Optimization
Focus of this presentation:
min
x∈Rd
{P(x)
def
= F(x) + R(x)
def
= 1
n
∑n
i=1 fi(x) + R(x)}
F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex)
fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex)
R: simple and (possibly) non-differentiable convex
6 / 39
7. Examples (ℓ = 0)
(a1, b1), . . . , (an, bn) ∈ Rd
× R: traning set.
Lasso:
fi(x)
def
=
1
2
(a⊤
i x − bi)2
, R(x)
def
= λ∥x∥1
Elastic Net logistic regression:
fi(x)
def
= log(1 + exp(−bia⊤
i x)) +
λ2
2
∥x∥2
2, R(x)
def
= λ1∥x∥1
Support vector machines:
fi(x)
def
= ¯hν
i (a⊤
i x) +
λ
2
∥x∥2
2, R(x)
def
= 0
− ¯hν
i : smoothed variant of hinge loss hi(u)
def
= max{0, 1 − biu}
7 / 39
8. Examples (ℓ > 0)
Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu
et al. (2017) have proposed algorithms for finding second-order
stationary points of smooth non-convex objectives.
− x is a (ε, δ)-second-order stationary point of f
def
⇔ ∥∇f(x)∥2
≤ ε and ∇2
f(x) ⪰ −δ
These algorithms are essentially based on two building blocks:
finding a first-order stationary point
finding a direction of the objective that has negative curvature
For exploiting negative curvature, these algorithms compute the
minimum eigenvector of the hessian.
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
8 / 39
9. Fast eigenvector computation:
Recently, Garber et al. (2016) has proposed a noble method for
finding approximate eigenvectors using convex optimization.
Essential subproblem :
min
z∈Rd
{g(z)
def
= 1
n
∑n
i=1 gi(z)
def
= 1
n
∑n
i=1
1
2
z⊤
(λ + ∇2
fi(x0))z − ⟨y, z⟩}
− λ > λmin(∇2
F(x0)) is assumed
− z∗ = (λ + ∇2
F(x0))−1
y
g is (λ + λmax(∇2
F(x0)), −(λ − λmin(∇2
F(x0)))-smooth
gi is (λ + λmax(∇2
fi(x0)), −(λ − λmin(∇2
fi(x0)))-smooth
Note that generally −(λ − λmin(∇2
fi(x0))) > 0, even though
−(λ − λmin(∇2
F(x0)) < 0.
9 / 39
10. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
10 / 39
11. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
Today’s focus
11 / 39
12. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
12 / 39
13. SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)]
(Proximal) Stochastic Variace Reduced Gradient
= SGD + Variance Reduction
SVRG(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage SVRG(xs−1, η, m, b)
Output: xS.
One Stage SVRG(x0, η, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
vk = 1
b
∑
i∈Ik
(∇fi(xk−1) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(xk−1 − ηvk).
Output: 1
m
∑m
k=1 xk.
13 / 39
14. vk = ∇fIk
(xk−1) − ∇fIk
(x0) + ∇F(x0)
Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1)
− V[vk] → 0 as xk−1, x0 → x∗
− Computaional cost per inner iteration is same as SGD’s
x0 (initial)
xk−1 (current)
xk (next)
∇F(x0)
∇fIk
(x0)
∇F(xk−1)
∇fIk
(xk−1)
vk
14 / 39
15. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SGD O
(
L
ε
+ 1
bµε
)
O
(
L
ε
+ 1
bµε
)
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear convergence
− Limit in mini-batch settings: SVRG requires at least
O
(
L
µ
log(1
ε
)
)
for any mini-batch size b
Questions:
The mini-batch efficiency of SVRG is improvable?
By Nesterov’s method SVRG can be accelerated?
15 / 39
16. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
16 / 39
17. AccProxSVRG [Nitanda (2014)]
Accelerated Proximal SVRG = SVRG + Inner Acceleration
AccProxSVRG(x0, η, β, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage AccProxSVRG(xs−1, η, β, m, b).
Output: xS.
One Stage AccProxSVRG(x0, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + β(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(yk − ηvk).
Output: xm.
17 / 39
18. yk = xk−1 + β(xk−1 − xk−2)
Main Idea: Usage of Nesterov’s momentum in each inner iteration
xk−2 (previous)
yk−1 (previous)
xk−1 (current)
yk (current)
xk (next)
Momentum
18 / 39
19. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
AccProxSVRG O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
No analysis
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear speed up w.r.t mini-batch size b:
L
µ
(SVRG) → L
bµ
(AccProxSVRG)
− No acceleration in non-mini-batch settings: the rate of
AccProxSVRG is same as the one of SVRG when b = 1
Question:
The identical rate between AccProxSVRG’s and SVRG’s in
non-mini-batch settings is improvable?
19 / 39
20. Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
20 / 39
21. Universal Catalyst [Lin et al. (2015)]
Universal Catalyst: a generic acceleration framework
Given an non-accelerated algorithm M (for example, SVRG),
UC(ˇx0, κ, {βt}, {εt}, T)
Iterating the following for t = 1, 2, . . . , T:
ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2).
Define Gt(x) = P(x) + κ
2
∥x − ˇyt∥2
2.
ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗
t ≤ εt by M.
Output: ˇxT .
Main Idea: Running IAPPA and solving each subproblem by M
− UC can be regard as an application of Inexact Accelerated PPA
(PPA: Proximal Point Algorithm).
21 / 39
22. Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
UC+SVRG O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
O
((
n
b
+
√
nL
bµ
+ n
3
4
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error, O hides extra log-factors
− Accelerated rate: L
µ
(SVRG) →
√
nL
bµ
(UC +SVRG)
− Sublinear speed up w.r.t mini-batch size b: not sufficient
− Katyusha also achieves the same rate
Practicality:
Hardness of tuning stopping criterions of subproblems
Many tuning parameters
Question:
The dependency on mini-batch size b is improvable?
22 / 39
23. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
23 / 39
24. Core Ideas
Double acceleration:
Combining Inner Acceleration and Outer Acceleration
Two approaches:
Applying UC to AccProxSVRG
Directly applying Nesterov’s acceleration to the outer iterations
of AccProxSVRG
The latter algorithm is more direct and practical.
24 / 39
25. Proposed Algorithm
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
= (SVRDA + Inner Acceleration) + Outer Acceleration
DASVRDAsc
(ˇx0, η, m, b, S, T)
Iterating the following for t = 1, 2, . . . , T:
ˇxt = DASVRDAns
(ˇxt−1, η, m, b, S).
Output: ˇxT .
DASVRDAns
(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
ys = xs−1 + s−1
s+2
(xs−1 − xs−2) + s+1
s+2
(zs−1 − xs−1).
(xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b).
Output: xS.
25 / 39
26. One Stage AccSVRDA(x0, x, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + k−1
k+1
(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x)) + ∇F(x).
¯vk =
(
1 − 2
k+1
)
¯vk−1 + 2
k+1
vk
zk = proxηk(k+1)
4
(x0 − ηk(k+1)
4
¯vk)
xk =
(
1 − 2
k+1
)
xk−1 + 2
k+1
zk.
Output: (xm, zm).
Main Idea: Combining Inner Acceleration and Outer Acceleration
− For outer acceleration, adding new momentum s+1
s+2
(zs−1 − xs−1)
− AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction
Why SVRDA rather than SVRG?
− Only because lazy updates for AccSVRDA can be constructed.
26 / 39
28. Convergence Analysis (ℓ = 0)
Theorem (ℓ = 0)
Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we
appropriately choose η = O
( 1
(1+n/b2)L
)
, S = O
(
1 + b
n
√
L
µ
+
√
L
nµ
)
and T = O(1), then DASVRDAsc
achieves an iteration complexity of
O
((
n
b
+
1
b
√
nL
µ
+
√
L
µ
)
log
(
1
ε
))
for E[P(ˇxT ) − P(x∗)] ≤ ε.
− In contrast, AccProxSVRG: O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
,
UC + SVRG: O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
.
28 / 39
29. Extension to ℓ ≥ 0
For generalizing our results to the case ℓ ≥ 0, we adopt UC +
AccProxSVRG approach.
− For theoretical guarantee, non-trivial modifications to the
algorithm of AccProxSVRG are needed.
UC + AccProxSVRG achieves
O
n
b
+
1
b
√
nL
µ
+
n
4
3
b
√
(Lℓ)
1
2
µ
log
(
1
ε
)
− In contrast, UC + SVRG only achieves
O
((
n
b
+
√
nL
bµ
+ n
4
3
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
.
29 / 39
30. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
30 / 39
31. Experimental Settings
Model: Elastic Net logistic regression
− Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6)
− µ = 10−6, ℓ = 0
Data sets and mini-batch sizes:
Data sets n d b
a9a 32, 561 123 180
rcv1 20, 242 47, 236 140
sido0 12, 678 4, 932 100
Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG,
UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and
DASVRDA with heuristic adaptive restart
31 / 39
35. Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
35 / 39
36. Summary
Conclusion:
New methods for solving convex composite optimization in
mini-batch settings
− Improvement of the mini-batch efficiency of previous methods
− Extention to sum-of-nonconvex objectives
− Numerical outperformance to the state-of-the-art methods
36 / 39
37. Reference I
Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of
Stochastic Gradient Methods. In 48th Annual ACM Symposium on
the Theory of Computing, pages 19–23.
Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via
first-order oracles. arXiv preprint arXiv:1711.06673.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016).
Accelerated methods for non-convex optimization. arXiv preprint
arXiv:1611.00756.
Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and
Sidford, A. (2016). Faster eigenvector computation via
shift-and-invert preconditioning. In Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 2626–2634.
37 / 39
38. Reference II
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient
descent using predictive variance reduction. In Advances in Neural
Information Processing Systems 26, pages 315–323.
Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for
first-order optimization. In Advances in Neural Information
Processing Systems 28, pages 3384–3392.
Nitanda, A. (2014). Stochastic proximal gradient descent with
acceleration techniques. In Advances in Neural Information
Processing Systems 27, pages 1574–1582.
Xiao, L. (2009). Dual averaging method for regularized stochastic
learning and online optimization. In Advances in Neural
Information Processing Systems 22, pages 2116–2124.
38 / 39
39. Reference III
Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient
method with progressive variance reduction. SIAM Journal on
Optimization, 24(4), 2057–2075.
Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative
curvature computations: Finding local minima more efficiently.
arXiv preprint arXiv:1712.03950.
39 / 39