SlideShare a Scribd company logo
1 of 39
Download to read offline
Doubly Accelerated
Stochastic Variance Reduced Gradient Methods
for Regularized Empirical Risk Minimization
Tomoya Murata†
, Taiji Suzukiद
†NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO
Jan. 13, 2018
1 / 39
This Presentation
Murata and Suzuki:
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
Method for Regularized Empirical Risk Minimization, NIPS 2017
+ some extensions
2 / 39
Overview
What:
New methods for solving convex composite optimization in
mini-batch settings
Main result:
Improvement of the mini-batch efficiency of previous methods
− Mini-batch efficiency
: We say that A is more mini-batch efficient than B, if A’s
necessary mini-batch size for achieving given iteration complexity
is smaller than B’s.
− Iteration complexity
: Necessary number of parameter updates to achieve a desired
optimization error
3 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
4 / 39
Smoothness
Definition :
We say that f : Rd
→ R is (L, ℓ)-smooth (L > 0) if
−
ℓ
2
∥x − y∥2
≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤
L
2
∥x − y∥2
.
− Lower smoothness ℓ ≤ 0 implies (strong) convexity of f
− Lower smoothness ℓ > 0 implies non-convexity of f
5 / 39
Convex Composite Optimization
Focus of this presentation:
min
x∈Rd
{P(x)
def
= F(x) + R(x)
def
= 1
n
∑n
i=1 fi(x) + R(x)}
F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex)
fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex)
R: simple and (possibly) non-differentiable convex
6 / 39
Examples (ℓ = 0)
(a1, b1), . . . , (an, bn) ∈ Rd
× R: traning set.
Lasso:
fi(x)
def
=
1
2
(a⊤
i x − bi)2
, R(x)
def
= λ∥x∥1
Elastic Net logistic regression:
fi(x)
def
= log(1 + exp(−bia⊤
i x)) +
λ2
2
∥x∥2
2, R(x)
def
= λ1∥x∥1
Support vector machines:
fi(x)
def
= ¯hν
i (a⊤
i x) +
λ
2
∥x∥2
2, R(x)
def
= 0
− ¯hν
i : smoothed variant of hinge loss hi(u)
def
= max{0, 1 − biu}
7 / 39
Examples (ℓ > 0)
Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu
et al. (2017) have proposed algorithms for finding second-order
stationary points of smooth non-convex objectives.
− x is a (ε, δ)-second-order stationary point of f
def
⇔ ∥∇f(x)∥2
≤ ε and ∇2
f(x) ⪰ −δ
These algorithms are essentially based on two building blocks:
finding a first-order stationary point
finding a direction of the objective that has negative curvature
For exploiting negative curvature, these algorithms compute the
minimum eigenvector of the hessian.
http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/
8 / 39
Fast eigenvector computation:
Recently, Garber et al. (2016) has proposed a noble method for
finding approximate eigenvectors using convex optimization.
Essential subproblem :
min
z∈Rd
{g(z)
def
= 1
n
∑n
i=1 gi(z)
def
= 1
n
∑n
i=1
1
2
z⊤
(λ + ∇2
fi(x0))z − ⟨y, z⟩}
− λ > λmin(∇2
F(x0)) is assumed
− z∗ = (λ + ∇2
F(x0))−1
y
g is (λ + λmax(∇2
F(x0)), −(λ − λmin(∇2
F(x0)))-smooth
gi is (λ + λmax(∇2
fi(x0)), −(λ − λmin(∇2
fi(x0)))-smooth
Note that generally −(λ − λmin(∇2
fi(x0))) > 0, even though
−(λ − λmin(∇2
F(x0)) < 0.
9 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
10 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
Today’s focus
11 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
12 / 39
SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)]
(Proximal) Stochastic Variace Reduced Gradient
= SGD + Variance Reduction
SVRG(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage SVRG(xs−1, η, m, b)
Output: xS.
One Stage SVRG(x0, η, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
vk = 1
b
∑
i∈Ik
(∇fi(xk−1) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(xk−1 − ηvk).
Output: 1
m
∑m
k=1 xk.
13 / 39
vk = ∇fIk
(xk−1) − ∇fIk
(x0) + ∇F(x0)
Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1)
− V[vk] → 0 as xk−1, x0 → x∗
− Computaional cost per inner iteration is same as SGD’s
x0 (initial)
xk−1 (current)
xk (next)
∇F(x0)
∇fIk
(x0)
∇F(xk−1)
∇fIk
(xk−1)
vk
14 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SGD O
(
L
ε
+ 1
bµε
)
O
(
L
ε
+ 1
bµε
)
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear convergence
− Limit in mini-batch settings: SVRG requires at least
O
(
L
µ
log(1
ε
)
)
for any mini-batch size b
Questions:
The mini-batch efficiency of SVRG is improvable?
By Nesterov’s method SVRG can be accelerated?
15 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
16 / 39
AccProxSVRG [Nitanda (2014)]
Accelerated Proximal SVRG = SVRG + Inner Acceleration
AccProxSVRG(x0, η, β, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
xs = One Stage AccProxSVRG(xs−1, η, β, m, b).
Output: xS.
One Stage AccProxSVRG(x0, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + β(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x0)) + ∇F(x0).
xk = proxηR(yk − ηvk).
Output: xm.
17 / 39
yk = xk−1 + β(xk−1 − xk−2)
Main Idea: Usage of Nesterov’s momentum in each inner iteration
xk−2 (previous)
yk−1 (previous)
xk−1 (current)
yk (current)
xk (next)
Momentum
18 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
AccProxSVRG O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
No analysis
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error
− Linear speed up w.r.t mini-batch size b:
L
µ
(SVRG) → L
bµ
(AccProxSVRG)
− No acceleration in non-mini-batch settings: the rate of
AccProxSVRG is same as the one of SVRG when b = 1
Question:
The identical rate between AccProxSVRG’s and SVRG’s in
non-mini-batch settings is improvable?
19 / 39
Relationships between Previous Work
GD
SGD AGD
SVRG
Katyusha AccProxSVRG
This Work
Inexact PPA
Inexact APPA
UC + SVRG
Randomization Outer Acceleration
Variance Reduction
Inner Acceleration
Universal Catalyst
Katyusha momentum
20 / 39
Universal Catalyst [Lin et al. (2015)]
Universal Catalyst: a generic acceleration framework
Given an non-accelerated algorithm M (for example, SVRG),
UC(ˇx0, κ, {βt}, {εt}, T)
Iterating the following for t = 1, 2, . . . , T:
ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2).
Define Gt(x) = P(x) + κ
2
∥x − ˇyt∥2
2.
ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗
t ≤ εt by M.
Output: ˇxT .
Main Idea: Running IAPPA and solving each subproblem by M
− UC can be regard as an application of Inexact Accelerated PPA
(PPA: Proximal Point Algorithm).
21 / 39
Comparisons of Iteration Complexities:
ℓ = 0 ℓ ≥ 0
SVRG O
((
n
b
+ L
µ
)
log
(1
ε
))
O
((
n
b
+ L
µ
+ Lℓ
bµ2
)
log
(1
ε
))
UC+SVRG O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
O
((
n
b
+
√
nL
bµ
+ n
3
4
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi,
b: mini-batch size, ε: optimization error, O hides extra log-factors
− Accelerated rate: L
µ
(SVRG) →
√
nL
bµ
(UC +SVRG)
− Sublinear speed up w.r.t mini-batch size b: not sufficient
− Katyusha also achieves the same rate
Practicality:
Hardness of tuning stopping criterions of subproblems
Many tuning parameters
Question:
The dependency on mini-batch size b is improvable?
22 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
23 / 39
Core Ideas
Double acceleration:
Combining Inner Acceleration and Outer Acceleration
Two approaches:
Applying UC to AccProxSVRG
Directly applying Nesterov’s acceleration to the outer iterations
of AccProxSVRG
The latter algorithm is more direct and practical.
24 / 39
Proposed Algorithm
Doubly Accelerated Stochastic Variance Reduced Dual Averaging
= (SVRDA + Inner Acceleration) + Outer Acceleration
DASVRDAsc
(ˇx0, η, m, b, S, T)
Iterating the following for t = 1, 2, . . . , T:
ˇxt = DASVRDAns
(ˇxt−1, η, m, b, S).
Output: ˇxT .
DASVRDAns
(x0, η, m, b, S)
Iterating the following for s = 1, 2, . . . , S:
ys = xs−1 + s−1
s+2
(xs−1 − xs−2) + s+1
s+2
(zs−1 − xs−1).
(xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b).
Output: xS.
25 / 39
One Stage AccSVRDA(x0, x, η, β, m, b)
Iterating the following for k = 1, 2, . . . , m:
Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly.
yk = xk−1 + k−1
k+1
(xk−1 − xk−2).
vk = 1
b
∑
i∈Ik
(∇fi(yk) − ∇fi(x)) + ∇F(x).
¯vk =
(
1 − 2
k+1
)
¯vk−1 + 2
k+1
vk
zk = proxηk(k+1)
4
(x0 − ηk(k+1)
4
¯vk)
xk =
(
1 − 2
k+1
)
xk−1 + 2
k+1
zk.
Output: (xm, zm).
Main Idea: Combining Inner Acceleration and Outer Acceleration
− For outer acceleration, adding new momentum s+1
s+2
(zs−1 − xs−1)
− AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction
Why SVRDA rather than SVRG?
− Only because lazy updates for AccSVRDA can be constructed.
26 / 39
ys = xs−1 + s−1
s+2(xs−1 − xs−2) + s+1
s+2(zs−1 − xs−1)
xs−2 (previous)
ys−1 (previous)
xs−1 = xm (current)
: weighted average of {zk}
zs−1 = zm (current)
ys (current)
xs (next)
z1
z2
z3
zm−1
Momentum
New momentum
27 / 39
Convergence Analysis (ℓ = 0)
Theorem (ℓ = 0)
Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we
appropriately choose η = O
( 1
(1+n/b2)L
)
, S = O
(
1 + b
n
√
L
µ
+
√
L
nµ
)
and T = O(1), then DASVRDAsc
achieves an iteration complexity of
O
((
n
b
+
1
b
√
nL
µ
+
√
L
µ
)
log
(
1
ε
))
for E[P(ˇxT ) − P(x∗)] ≤ ε.
− In contrast, AccProxSVRG: O
((
n
b
+ L
bµ
+
√
L
µ
)
log
(1
ε
))
,
UC + SVRG: O
((
n
b
+
√
nL
bµ
)
log
(1
ε
))
.
28 / 39
Extension to ℓ ≥ 0
For generalizing our results to the case ℓ ≥ 0, we adopt UC +
AccProxSVRG approach.
− For theoretical guarantee, non-trivial modifications to the
algorithm of AccProxSVRG are needed.
UC + AccProxSVRG achieves
O



n
b
+
1
b
√
nL
µ
+
n
4
3
b
√
(Lℓ)
1
2
µ

 log
(
1
ε
)


− In contrast, UC + SVRG only achieves
O
((
n
b
+
√
nL
bµ
+ n
4
3
b
√
(Lℓ)
1
2
µ
)
log
(1
ε
)
)
.
29 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
30 / 39
Experimental Settings
Model: Elastic Net logistic regression
− Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6)
− µ = 10−6, ℓ = 0
Data sets and mini-batch sizes:
Data sets n d b
a9a 32, 561 123 180
rcv1 20, 242 47, 236 140
sido0 12, 678 4, 932 100
Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG,
UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and
DASVRDA with heuristic adaptive restart
31 / 39
Numerical Results
Comparisons on a9a data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
32 / 39
Comparisons on rcv1 data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
33 / 39
Comparisons on sido0 data set:
Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6)
34 / 39
Outline
1 Problem Setup
2 Previous Work
3 Proposed methods
4 Numerical Experiments
5 Summary
35 / 39
Summary
Conclusion:
New methods for solving convex composite optimization in
mini-batch settings
− Improvement of the mini-batch efficiency of previous methods
− Extention to sum-of-nonconvex objectives
− Numerical outperformance to the state-of-the-art methods
36 / 39
Reference I
Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of
Stochastic Gradient Methods. In 48th Annual ACM Symposium on
the Theory of Computing, pages 19–23.
Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via
first-order oracles. arXiv preprint arXiv:1711.06673.
Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016).
Accelerated methods for non-convex optimization. arXiv preprint
arXiv:1611.00756.
Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and
Sidford, A. (2016). Faster eigenvector computation via
shift-and-invert preconditioning. In Proceedings of The 33rd
International Conference on Machine Learning, volume 48 of
Proceedings of Machine Learning Research, pages 2626–2634.
37 / 39
Reference II
Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient
descent using predictive variance reduction. In Advances in Neural
Information Processing Systems 26, pages 315–323.
Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for
first-order optimization. In Advances in Neural Information
Processing Systems 28, pages 3384–3392.
Nitanda, A. (2014). Stochastic proximal gradient descent with
acceleration techniques. In Advances in Neural Information
Processing Systems 27, pages 1574–1582.
Xiao, L. (2009). Dual averaging method for regularized stochastic
learning and online optimization. In Advances in Neural
Information Processing Systems 22, pages 2116–2124.
38 / 39
Reference III
Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient
method with progressive variance reduction. SIAM Journal on
Optimization, 24(4), 2057–2075.
Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative
curvature computations: Finding local minima more efficiently.
arXiv preprint arXiv:1712.03950.
39 / 39

More Related Content

What's hot

On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsVjekoslavKovac1
 
Norm-variation of bilinear averages
Norm-variation of bilinear averagesNorm-variation of bilinear averages
Norm-variation of bilinear averagesVjekoslavKovac1
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Multilinear Twisted Paraproducts
Multilinear Twisted ParaproductsMultilinear Twisted Paraproducts
Multilinear Twisted ParaproductsVjekoslavKovac1
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling SetsVjekoslavKovac1
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureVjekoslavKovac1
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersVjekoslavKovac1
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flowsVjekoslavKovac1
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...VjekoslavKovac1
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesVjekoslavKovac1
 
On maximal and variational Fourier restriction
On maximal and variational Fourier restrictionOn maximal and variational Fourier restriction
On maximal and variational Fourier restrictionVjekoslavKovac1
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilationsVjekoslavKovac1
 

What's hot (20)

On Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular IntegralsOn Twisted Paraproducts and some other Multilinear Singular Integrals
On Twisted Paraproducts and some other Multilinear Singular Integrals
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Norm-variation of bilinear averages
Norm-variation of bilinear averagesNorm-variation of bilinear averages
Norm-variation of bilinear averages
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Multilinear Twisted Paraproducts
Multilinear Twisted ParaproductsMultilinear Twisted Paraproducts
Multilinear Twisted Paraproducts
 
Some Examples of Scaling Sets
Some Examples of Scaling SetsSome Examples of Scaling Sets
Some Examples of Scaling Sets
 
Multilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structureMultilinear singular integrals with entangled structure
Multilinear singular integrals with entangled structure
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
Estimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliersEstimates for a class of non-standard bilinear multipliers
Estimates for a class of non-standard bilinear multipliers
 
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
2018 MUMS Fall Course - Bayesian inference for model calibration in UQ - Ralp...
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Tales on two commuting transformations or flows
Tales on two commuting transformations or flowsTales on two commuting transformations or flows
Tales on two commuting transformations or flows
 
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
Variants of the Christ-Kiselev lemma and an application to the maximal Fourie...
 
Quantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averagesQuantitative norm convergence of some ergodic averages
Quantitative norm convergence of some ergodic averages
 
On maximal and variational Fourier restriction
On maximal and variational Fourier restrictionOn maximal and variational Fourier restriction
On maximal and variational Fourier restriction
 
A Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cubeA Szemerédi-type theorem for subsets of the unit cube
A Szemerédi-type theorem for subsets of the unit cube
 
Paraproducts with general dilations
Paraproducts with general dilationsParaproducts with general dilations
Paraproducts with general dilations
 

Similar to Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization

Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Shizuoka Inst. Science and Tech.
 
Chapter 1 (math 1)
Chapter 1 (math 1)Chapter 1 (math 1)
Chapter 1 (math 1)Amr Mohamed
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniquesKrishna Gali
 
Natural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesNatural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesMark Brandao
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)CrackDSE
 
Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Nur Kamila
 
Additional Mathematics form 4 (formula)
Additional Mathematics form 4 (formula)Additional Mathematics form 4 (formula)
Additional Mathematics form 4 (formula)Fatini Adnan
 
Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Asad Bukhari
 
Rosser's theorem
Rosser's theoremRosser's theorem
Rosser's theoremWathna
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationMark Chang
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
Lp and ip programming cp 9
Lp and ip programming cp 9Lp and ip programming cp 9
Lp and ip programming cp 9M S Prasad
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Yandex
 
Phase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and TechniquesPhase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and TechniquesVaibhav Dixit
 

Similar to Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization (20)

02 basics i-handout
02 basics i-handout02 basics i-handout
02 basics i-handout
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
 
Chapter 1 (math 1)
Chapter 1 (math 1)Chapter 1 (math 1)
Chapter 1 (math 1)
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniques
 
Natural and Clamped Cubic Splines
Natural and Clamped Cubic SplinesNatural and Clamped Cubic Splines
Natural and Clamped Cubic Splines
 
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
QMC: Operator Splitting Workshop, A Splitting Method for Nonsmooth Nonconvex ...
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
Section4 stochastic
Section4 stochasticSection4 stochastic
Section4 stochastic
 
ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)ISI MSQE Entrance Question Paper (2008)
ISI MSQE Entrance Question Paper (2008)
 
Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01
 
Additional Mathematics form 4 (formula)
Additional Mathematics form 4 (formula)Additional Mathematics form 4 (formula)
Additional Mathematics form 4 (formula)
 
Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01Spm add-maths-formula-list-form4-091022090639-phpapp01
Spm add-maths-formula-list-form4-091022090639-phpapp01
 
Rosser's theorem
Rosser's theoremRosser's theorem
Rosser's theorem
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
Sample question paper 2 with solution
Sample question paper 2 with solutionSample question paper 2 with solution
Sample question paper 2 with solution
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
Lp and ip programming cp 9
Lp and ip programming cp 9Lp and ip programming cp 9
Lp and ip programming cp 9
 
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
Subgradient Methods for Huge-Scale Optimization Problems - Юрий Нестеров, Cat...
 
Phase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and TechniquesPhase Retrieval: Motivation and Techniques
Phase Retrieval: Motivation and Techniques
 

Recently uploaded

FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...D. B. S. College Kanpur
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxnoordubaliya2003
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxkumarsanjai28051
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...lizamodels9
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPirithiRaju
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXDole Philippines School
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptJoemSTuliba
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationColumbia Weather Systems
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPirithiRaju
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxmaryFF1
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPirithiRaju
 

Recently uploaded (20)

FREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by naFREE NURSING BUNDLE FOR NURSES.PDF by na
FREE NURSING BUNDLE FOR NURSES.PDF by na
 
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
Fertilization: Sperm and the egg—collectively called the gametes—fuse togethe...
 
preservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptxpreservation, maintanence and improvement of industrial organism.pptx
preservation, maintanence and improvement of industrial organism.pptx
 
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptxGenBio2 - Lesson 1 - Introduction to Genetics.pptx
GenBio2 - Lesson 1 - Introduction to Genetics.pptx
 
Forensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptxForensic limnology of diatoms by Sanjai.pptx
Forensic limnology of diatoms by Sanjai.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)
 
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
Best Call Girls In Sector 29 Gurgaon❤️8860477959 EscorTs Service In 24/7 Delh...
 
Environmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial BiosensorEnvironmental Biotechnology Topic:- Microbial Biosensor
Environmental Biotechnology Topic:- Microbial Biosensor
 
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdfPests of jatropha_Bionomics_identification_Dr.UPR.pdf
Pests of jatropha_Bionomics_identification_Dr.UPR.pdf
 
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTXALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
ALL ABOUT MIXTURES IN GRADE 7 CLASS PPTX
 
Pests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdfPests of Bengal gram_Identification_Dr.UPR.pdf
Pests of Bengal gram_Identification_Dr.UPR.pdf
 
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort ServiceHot Sexy call girls in  Moti Nagar,🔝 9953056974 🔝 escort Service
Hot Sexy call girls in Moti Nagar,🔝 9953056974 🔝 escort Service
 
Four Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.pptFour Spheres of the Earth Presentation.ppt
Four Spheres of the Earth Presentation.ppt
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
User Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather StationUser Guide: Capricorn FLX™ Weather Station
User Guide: Capricorn FLX™ Weather Station
 
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdfPests of safflower_Binomics_Identification_Dr.UPR.pdf
Pests of safflower_Binomics_Identification_Dr.UPR.pdf
 
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptxECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
ECG Graph Monitoring with AD8232 ECG Sensor & Arduino.pptx
 
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdfPests of soyabean_Binomics_IdentificationDr.UPR.pdf
Pests of soyabean_Binomics_IdentificationDr.UPR.pdf
 

Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization

  • 1. Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regularized Empirical Risk Minimization Tomoya Murata† , Taiji Suzuki‡§¶ †NTT DATA Mathematical Systems Inc., ‡The University of Tokyo, §RIKEN, ¶PRESTO Jan. 13, 2018 1 / 39
  • 2. This Presentation Murata and Suzuki: Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization, NIPS 2017 + some extensions 2 / 39
  • 3. Overview What: New methods for solving convex composite optimization in mini-batch settings Main result: Improvement of the mini-batch efficiency of previous methods − Mini-batch efficiency : We say that A is more mini-batch efficient than B, if A’s necessary mini-batch size for achieving given iteration complexity is smaller than B’s. − Iteration complexity : Necessary number of parameter updates to achieve a desired optimization error 3 / 39
  • 4. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 4 / 39
  • 5. Smoothness Definition : We say that f : Rd → R is (L, ℓ)-smooth (L > 0) if − ℓ 2 ∥x − y∥2 ≤ f(x) − f(y) − ⟨∇f(y), x − y⟩ ≤ L 2 ∥x − y∥2 . − Lower smoothness ℓ ≤ 0 implies (strong) convexity of f − Lower smoothness ℓ > 0 implies non-convexity of f 5 / 39
  • 6. Convex Composite Optimization Focus of this presentation: min x∈Rd {P(x) def = F(x) + R(x) def = 1 n ∑n i=1 fi(x) + R(x)} F: (L, −µ)-smooth (L > 0, µ > 0) (i.e., µ-strongly convex) fi: (L, ℓ)-smooth (L > 0, ℓ ≥ 0) (i.e., generally nonconvex) R: simple and (possibly) non-differentiable convex 6 / 39
  • 7. Examples (ℓ = 0) (a1, b1), . . . , (an, bn) ∈ Rd × R: traning set. Lasso: fi(x) def = 1 2 (a⊤ i x − bi)2 , R(x) def = λ∥x∥1 Elastic Net logistic regression: fi(x) def = log(1 + exp(−bia⊤ i x)) + λ2 2 ∥x∥2 2, R(x) def = λ1∥x∥1 Support vector machines: fi(x) def = ¯hν i (a⊤ i x) + λ 2 ∥x∥2 2, R(x) def = 0 − ¯hν i : smoothed variant of hinge loss hi(u) def = max{0, 1 − biu} 7 / 39
  • 8. Examples (ℓ > 0) Recently, Carmon et al. (2016), Allen-Zhu and Li (2017) and Yu et al. (2017) have proposed algorithms for finding second-order stationary points of smooth non-convex objectives. − x is a (ε, δ)-second-order stationary point of f def ⇔ ∥∇f(x)∥2 ≤ ε and ∇2 f(x) ⪰ −δ These algorithms are essentially based on two building blocks: finding a first-order stationary point finding a direction of the objective that has negative curvature For exploiting negative curvature, these algorithms compute the minimum eigenvector of the hessian. http://bair.berkeley.edu/blog/2017/08/31/saddle-efficiency/ 8 / 39
  • 9. Fast eigenvector computation: Recently, Garber et al. (2016) has proposed a noble method for finding approximate eigenvectors using convex optimization. Essential subproblem : min z∈Rd {g(z) def = 1 n ∑n i=1 gi(z) def = 1 n ∑n i=1 1 2 z⊤ (λ + ∇2 fi(x0))z − ⟨y, z⟩} − λ > λmin(∇2 F(x0)) is assumed − z∗ = (λ + ∇2 F(x0))−1 y g is (λ + λmax(∇2 F(x0)), −(λ − λmin(∇2 F(x0)))-smooth gi is (λ + λmax(∇2 fi(x0)), −(λ − λmin(∇2 fi(x0)))-smooth Note that generally −(λ − λmin(∇2 fi(x0))) > 0, even though −(λ − λmin(∇2 F(x0)) < 0. 9 / 39
  • 10. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 10 / 39
  • 11. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum Today’s focus 11 / 39
  • 12. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 12 / 39
  • 13. SVRG [Johnson and Zhang (2013); Xiao and Zhang (2014)] (Proximal) Stochastic Variace Reduced Gradient = SGD + Variance Reduction SVRG(x0, η, m, b, S) Iterating the following for s = 1, 2, . . . , S: xs = One Stage SVRG(xs−1, η, m, b) Output: xS. One Stage SVRG(x0, η, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. vk = 1 b ∑ i∈Ik (∇fi(xk−1) − ∇fi(x0)) + ∇F(x0). xk = proxηR(xk−1 − ηvk). Output: 1 m ∑m k=1 xk. 13 / 39
  • 14. vk = ∇fIk (xk−1) − ∇fIk (x0) + ∇F(x0) Main Idea: Usage of vk as an unbiased estimator of ∇F(xk−1) − V[vk] → 0 as xk−1, x0 → x∗ − Computaional cost per inner iteration is same as SGD’s x0 (initial) xk−1 (current) xk (next) ∇F(x0) ∇fIk (x0) ∇F(xk−1) ∇fIk (xk−1) vk 14 / 39
  • 15. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SGD O ( L ε + 1 bµε ) O ( L ε + 1 bµε ) SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error − Linear convergence − Limit in mini-batch settings: SVRG requires at least O ( L µ log(1 ε ) ) for any mini-batch size b Questions: The mini-batch efficiency of SVRG is improvable? By Nesterov’s method SVRG can be accelerated? 15 / 39
  • 16. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 16 / 39
  • 17. AccProxSVRG [Nitanda (2014)] Accelerated Proximal SVRG = SVRG + Inner Acceleration AccProxSVRG(x0, η, β, m, b, S) Iterating the following for s = 1, 2, . . . , S: xs = One Stage AccProxSVRG(xs−1, η, β, m, b). Output: xS. One Stage AccProxSVRG(x0, η, β, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. yk = xk−1 + β(xk−1 − xk−2). vk = 1 b ∑ i∈Ik (∇fi(yk) − ∇fi(x0)) + ∇F(x0). xk = proxηR(yk − ηvk). Output: xm. 17 / 39
  • 18. yk = xk−1 + β(xk−1 − xk−2) Main Idea: Usage of Nesterov’s momentum in each inner iteration xk−2 (previous) yk−1 (previous) xk−1 (current) yk (current) xk (next) Momentum 18 / 39
  • 19. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) AccProxSVRG O (( n b + L bµ + √ L µ ) log (1 ε )) No analysis n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error − Linear speed up w.r.t mini-batch size b: L µ (SVRG) → L bµ (AccProxSVRG) − No acceleration in non-mini-batch settings: the rate of AccProxSVRG is same as the one of SVRG when b = 1 Question: The identical rate between AccProxSVRG’s and SVRG’s in non-mini-batch settings is improvable? 19 / 39
  • 20. Relationships between Previous Work GD SGD AGD SVRG Katyusha AccProxSVRG This Work Inexact PPA Inexact APPA UC + SVRG Randomization Outer Acceleration Variance Reduction Inner Acceleration Universal Catalyst Katyusha momentum 20 / 39
  • 21. Universal Catalyst [Lin et al. (2015)] Universal Catalyst: a generic acceleration framework Given an non-accelerated algorithm M (for example, SVRG), UC(ˇx0, κ, {βt}, {εt}, T) Iterating the following for t = 1, 2, . . . , T: ˇyt = ˇxt−1 + βt(ˇxt−1 − ˇxt−2). Define Gt(x) = P(x) + κ 2 ∥x − ˇyt∥2 2. ˇxt ≈ argminx∈Rd Gt(x) s.t. Gt(ˇxt) − G∗ t ≤ εt by M. Output: ˇxT . Main Idea: Running IAPPA and solving each subproblem by M − UC can be regard as an application of Inexact Accelerated PPA (PPA: Proximal Point Algorithm). 21 / 39
  • 22. Comparisons of Iteration Complexities: ℓ = 0 ℓ ≥ 0 SVRG O (( n b + L µ ) log (1 ε )) O (( n b + L µ + Lℓ bµ2 ) log (1 ε )) UC+SVRG O (( n b + √ nL bµ ) log (1 ε )) O (( n b + √ nL bµ + n 3 4 b √ (Lℓ) 1 2 µ ) log (1 ε ) ) n: training set size, L: upper smoothness of fi, ℓ: lower smoothness of fi, b: mini-batch size, ε: optimization error, O hides extra log-factors − Accelerated rate: L µ (SVRG) → √ nL bµ (UC +SVRG) − Sublinear speed up w.r.t mini-batch size b: not sufficient − Katyusha also achieves the same rate Practicality: Hardness of tuning stopping criterions of subproblems Many tuning parameters Question: The dependency on mini-batch size b is improvable? 22 / 39
  • 23. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 23 / 39
  • 24. Core Ideas Double acceleration: Combining Inner Acceleration and Outer Acceleration Two approaches: Applying UC to AccProxSVRG Directly applying Nesterov’s acceleration to the outer iterations of AccProxSVRG The latter algorithm is more direct and practical. 24 / 39
  • 25. Proposed Algorithm Doubly Accelerated Stochastic Variance Reduced Dual Averaging = (SVRDA + Inner Acceleration) + Outer Acceleration DASVRDAsc (ˇx0, η, m, b, S, T) Iterating the following for t = 1, 2, . . . , T: ˇxt = DASVRDAns (ˇxt−1, η, m, b, S). Output: ˇxT . DASVRDAns (x0, η, m, b, S) Iterating the following for s = 1, 2, . . . , S: ys = xs−1 + s−1 s+2 (xs−1 − xs−2) + s+1 s+2 (zs−1 − xs−1). (xs, zs) = One Stage AccSVRDA(ys, xs−1, η, β, m, b). Output: xS. 25 / 39
  • 26. One Stage AccSVRDA(x0, x, η, β, m, b) Iterating the following for k = 1, 2, . . . , m: Pick Ik ⊂ {1, 2, . . . , n} with size b uniformly. yk = xk−1 + k−1 k+1 (xk−1 − xk−2). vk = 1 b ∑ i∈Ik (∇fi(yk) − ∇fi(x)) + ∇F(x). ¯vk = ( 1 − 2 k+1 ) ¯vk−1 + 2 k+1 vk zk = proxηk(k+1) 4 (x0 − ηk(k+1) 4 ¯vk) xk = ( 1 − 2 k+1 ) xk−1 + 2 k+1 zk. Output: (xm, zm). Main Idea: Combining Inner Acceleration and Outer Acceleration − For outer acceleration, adding new momentum s+1 s+2 (zs−1 − xs−1) − AccSVRDA = AccSDA [Xiao (2009)] + Variance Reduction Why SVRDA rather than SVRG? − Only because lazy updates for AccSVRDA can be constructed. 26 / 39
  • 27. ys = xs−1 + s−1 s+2(xs−1 − xs−2) + s+1 s+2(zs−1 − xs−1) xs−2 (previous) ys−1 (previous) xs−1 = xm (current) : weighted average of {zk} zs−1 = zm (current) ys (current) xs (next) z1 z2 z3 zm−1 Momentum New momentum 27 / 39
  • 28. Convergence Analysis (ℓ = 0) Theorem (ℓ = 0) Assume that F is (L, −µ)-smooth and fi is (L, 0)-smooth. If we appropriately choose η = O ( 1 (1+n/b2)L ) , S = O ( 1 + b n √ L µ + √ L nµ ) and T = O(1), then DASVRDAsc achieves an iteration complexity of O (( n b + 1 b √ nL µ + √ L µ ) log ( 1 ε )) for E[P(ˇxT ) − P(x∗)] ≤ ε. − In contrast, AccProxSVRG: O (( n b + L bµ + √ L µ ) log (1 ε )) , UC + SVRG: O (( n b + √ nL bµ ) log (1 ε )) . 28 / 39
  • 29. Extension to ℓ ≥ 0 For generalizing our results to the case ℓ ≥ 0, we adopt UC + AccProxSVRG approach. − For theoretical guarantee, non-trivial modifications to the algorithm of AccProxSVRG are needed. UC + AccProxSVRG achieves O    n b + 1 b √ nL µ + n 4 3 b √ (Lℓ) 1 2 µ   log ( 1 ε )   − In contrast, UC + SVRG only achieves O (( n b + √ nL bµ + n 4 3 b √ (Lℓ) 1 2 µ ) log (1 ε ) ) . 29 / 39
  • 30. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 30 / 39
  • 31. Experimental Settings Model: Elastic Net logistic regression − Regularization parameters: (λ1, λ2) = (10−4, 10−6), (0, 10−6) − µ = 10−6, ℓ = 0 Data sets and mini-batch sizes: Data sets n d b a9a 32, 561 123 180 rcv1 20, 242 47, 236 140 sido0 12, 678 4, 932 100 Implemented algorithms: SVRG, UC+SVRG, AccProxSVRG, UC+AccProxSVRG, APCG (dual), Katyusha, DASVRDA and DASVRDA with heuristic adaptive restart 31 / 39
  • 32. Numerical Results Comparisons on a9a data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 32 / 39
  • 33. Comparisons on rcv1 data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 33 / 39
  • 34. Comparisons on sido0 data set: Figure: (λ1, λ2) = (10−4, 10−6) Figure: (λ1, λ2) = (0, 10−6) 34 / 39
  • 35. Outline 1 Problem Setup 2 Previous Work 3 Proposed methods 4 Numerical Experiments 5 Summary 35 / 39
  • 36. Summary Conclusion: New methods for solving convex composite optimization in mini-batch settings − Improvement of the mini-batch efficiency of previous methods − Extention to sum-of-nonconvex objectives − Numerical outperformance to the state-of-the-art methods 36 / 39
  • 37. Reference I Allen-Zhu, Z. (2017). Katyusha: The First Direct Acceleration of Stochastic Gradient Methods. In 48th Annual ACM Symposium on the Theory of Computing, pages 19–23. Allen-Zhu, Z. and Li, Y. (2017). Neon2: Finding local minima via first-order oracles. arXiv preprint arXiv:1711.06673. Carmon, Y., Duchi, J. C., Hinder, O., and Sidford, A. (2016). Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756. Garber, D., Hazan, E., Jin, C., Sham, Musco, C., Netrapalli, P., and Sidford, A. (2016). Faster eigenvector computation via shift-and-invert preconditioning. In Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 2626–2634. 37 / 39
  • 38. Reference II Johnson, R. and Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 26, pages 315–323. Lin, H., Mairal, J., and Harchaoui, Z. (2015). A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems 28, pages 3384–3392. Nitanda, A. (2014). Stochastic proximal gradient descent with acceleration techniques. In Advances in Neural Information Processing Systems 27, pages 1574–1582. Xiao, L. (2009). Dual averaging method for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 22, pages 2116–2124. 38 / 39
  • 39. Reference III Xiao, L. and Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075. Yu, Y., Zou, D., and Gu, Q. (2017). Saving gradient and negative curvature computations: Finding local minima more efficiently. arXiv preprint arXiv:1712.03950. 39 / 39