2. Goal of this study
Machine learning problems with a structured regularization.
min
w2Rp
1
n
Σn
i=1
fi (z⊤
i w) + (B⊤w)
We propose stochastic optimization methods to solve the above problem.
Stochastic optimization
+
Alternating Direction Method of Multipliers (ADMM)
Advantageous points
Few samples are needed for each update.
(We don't need to read all the samples for each iteration.)
Wide range of applications
Easy implementation
Nice convergence properties
Parallelizable
2 / 49
3. Proposed method and existing methods
Stochastic methods for regularized learning problems.
Normal ADMM
Online Proximal gradient type (Nesterov,
2007)
Online-ADMM (Wang and Banerjee,
2012)
FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang
et al., 2013)
Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013)
RDA (Xiao, 2009)
Batch
SDCA(Shalev-Shwartz and Zhang, 2013b)
SDCA-ADMM (Suzuki, 2014)
(Stochastic Dual Coordinate Ascent)
SAG (Le Roux et al., 2013)
(Stochastic Averaging Gradient)
SVRG (Johnson and Zhang, 2013)
(Stochastic Variance Reduced Gradient)
Online type: O(1=
p
T) in general,
O(log(T)=T) for a strongly convex objective.
Batch type: linear convergence.
3 / 49
4. Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
4 / 49
6. High dimensional data analysis
Redundant information deteriorates the estimation accuracy.
Bio-informatics Text data Image data
5 / 49
7. Sparse estimation
Cut off redundant information → sparsity
R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal.
Statist. Soc B., Vol. 58, No. 1, pages 267{288. Citation: 10185 (25th/May/2014)
6 / 49
8. Regularized learning problem
Lasso:
min
w2Rp
1
n
Σn
i=1
⊤
i w)2 + ∥w∥| {z }1
(yi z
:
regularization
Sparsity: Useful for high dimensional learning problem.
Sparsity inducing regularization is usually non-smooth.
7 / 49
9. Regularized learning problem
Lasso:
min
w2Rp
1
n
Σn
i=1
⊤
i w)2 + ∥w∥| {z }1
(yi z
:
regularization
Sparsity: Useful for high dimensional learning problem.
Sparsity inducing regularization is usually non-smooth.
General regularized learning problem:
min
w2Rp
1
n
Σn
i=1
⊤
i w) + ~ (w):
fi (z
7 / 49
10. Proximal mapping
Regularized learning problem:
min
w2Rp
1
n
Σn
i=1
⊤
i w) + ~ (w):
fi (z
gt 2 @w
( 1
n
Σn
i=1 ft (z⊤
)
jw=wt ; gt = 1
t w)
t
Σt
=1 g :
Proximal gradient descent: wt+1 = arg minw2Rp
{
g⊤
t w + ~ (w) + 1
2t
∥w wt∥2
}
:
Regularized dual averaging: wt+1 = arg minw2Rp
{
g⊤
t w + ~ (w) + 1
2t
∥w∥2
}
:
A key computation is the proximal mapping:
prox(qj ~ ) := arg min
x
{
~ (x) +
1
2
∥x q∥2
}
:
8 / 49
11. Example of Proximal mapping: ℓ1 regularization
prox(qjC∥ ∥1) = arg min
x
{
C∥x∥1 +
1
2
∥x q∥2
}
= (sign(qj ) max(jqj j C; 0))j :
! Soft-thresholding function. Analytic form!
9 / 49
12. Motivation
Structured regularization with large sample
min
w2Rp
1
n
Σn
i=1
fi (z⊤
i w)
| {z }
large samples
+ ~ | {(wz })
complicated
Difficulties (complex regularization and many samples)
Proximal mapping can not be easily computed:
prox(qj ~ ) := arg min
x
{
~ (x) +
1
2
∥x q∥2
}
:
Computation on the loss function is heavy for large number of
samples.
10 / 49
13. Motivation
Structured regularization with large sample
min
w2Rp
1
n
Σn
i=1
fi (z⊤
i w)
| {z }
large samples
+ ~ | {(wz })
complicated
Difficulties (complex regularization and many samples)
Proximal mapping can not be easily computed:
prox(qj ~ ) := arg min
x
{
~ (x) +
1
2
∥x q∥2
}
:
! ADMM (Alternating Direction Multiplier Method).
Computation on the loss function is heavy for large number of
samples.
! Stochastic optimization.
10 / 49
14. Examples
Overlapped group lasso
~ (w) = C
Σ
g2G
∥wg∥
The groups have over-
laps.
It is difficult to compute
the proximal mapping.
11 / 49
15. Examples
Overlapped group lasso
~ (w) = C
Σ
g2G
∥wg∥
The groups have over-
laps.
It is difficult to compute
the proximal mapping.
Solution:
Prepare for which proximal
mapping is easily computable.
Let (B⊤w) = ~ (w), and utilize the
proximal mapping w.r.t. .
BTw
11 / 49
16. Decomposition technique
Decompose into independent groups:
B
⊤
w =
(y) = C
Σ
g′2G′
∥yg′∥
prox(qj ) =
(
qg′max
{
1 C
})
∥qg′∥; 0
g′2G′
Graph regularization (Jacob et al., 2009),
Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al.,
2011)
Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy,
2013).
12 / 49
17. Another example
Graph regularization (Jacob et al., 2009)
~ (w) = C
Σ
(i ;j)2E
jwi wj j:
(y) = C
Σ
e2E
jye
j; B
⊤
w = (wi wj )(i ;j)2E
)
8
:
(B⊤w) = ~ (w);
prox(qj ) =
(
qemax
{
1 C
jqe
})
j ; 0
e2E
:
Soft-Thresholding function.
13 / 49
18. Optimizing composite objective function
min
w
1
n
Σn
i=1
⊤
i w) + (B
fi (z
⊤
w)
, min
w;y
1
n
Σn
i=1
⊤
i w) + (y) s:t: y = B
fi (z
⊤
w:
Augmented Lagrangian
L(w; y; ) =
1
n
Σ
i
fi (z
⊤
i w) + (y) +
⊤
(y B
⊤
w) +
2
∥y B
⊤
w∥2
sup infw;y L(w; y; ) yields the optimization of the original problem.
The augmented Lagrangian is the basis of the method of multipliers
(Hestenes, 1969; Powell, 1969; Rockafellar, 1976).
14 / 49
19. ADMM
min
w
1
n
Σn
i=1
fi (z
⊤
i w) + (B
⊤
w) , min
w;y
1
n
Σn
i=1
⊤
i w) + (y) s:t: y = B
fi (z
⊤
w:
L(w; y; ) = 1
n
Σ
fi (z⊤
i w) + (y) + ⊤(y B⊤w) +
2
∥y B⊤w∥2.
ADMM (Gabay and Mercier, 1976)
wt = arg min
w
1
n
Σn
i=1
⊤
i w) +
fi (z
⊤
t (B
⊤
w) +
2
∥yt1 B
⊤
w∥2
yt = arg min
y
⊤
t y +
(y) +
2
∥y B
⊤
wt∥2(= prox(B
⊤
wt t=j =))
t = t1 (B
⊤
wt yt )
The computational difficulty of proximal mapping is resolved.
15 / 49
20. ADMM
min
w
1
n
Σn
i=1
fi (z
⊤
i w) + (B
⊤
w) , min
w;y
1
n
Σn
i=1
⊤
i w) + (y) s:t: y = B
fi (z
⊤
w:
L(w; y; ) = 1
n
Σ
fi (z⊤
i w) + (y) + ⊤(y B⊤w) +
2
∥y B⊤w∥2.
ADMM (Gabay and Mercier, 1976)
wt = arg min
w
1
n
Σn
i=1
⊤
i w) +
fi (z
⊤
t (B
⊤
w) +
2
∥yt1 B
⊤
w∥2
yt = arg min
y
⊤
t y +
(y) +
2
∥y B
⊤
wt∥2(= prox(B
⊤
wt t=j =))
t = t1 (B
⊤
wt yt )
The computational difficulty of proximal mapping is resolved.
However, the computation of 1n
Σn
i=1 fi (z⊤
i w) is still heavy.
! Stochastic version of ADMM.
Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013)
Batch type: Suzuki (2014) ! linear convergence.
15 / 49
21. Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
16 / 49
22. Online setting
z
t-1
z
t
w
t
w
t+1
P(z)
2EVHUYH 8SGDWH 2EVHUYH 8SGDWH
min
w
1
T
ΣT
t=1
f (zt ;w) + ~ (w)
We don't want to read all the samples.
wt is updated per one (or few) observation.
17 / 49
23. Existing oneline optimization method
gt 2 @ft (wt ); gt = 1t
Σt
=1 g (ft (w) = f (zt ;w))
OPG (Online Proximal Gradient Descent)
wt+1 = arg min
w
{
⟨gt ;w⟩ + ~ (w) +
1
2t
∥w wt∥2
}
RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009))
wt+1 = arg min
w
{
⟨gt ;w⟩ + ~ (w) +
1
2t
∥w∥2
}
These update rule is computed by a proximal mapping associated with ~ .
Efficient for a simple regularization such as ℓ1 regularization.
How about structured regularization? ! ADMM.
18 / 49
24. Proposed method 1: RDA-ADMM
Ordinary RDA: wt+1 = arg minw
{
⟨gt ;w⟩ + ~ (w) + 1
2t
∥w∥2
}
RDA-ADMM
Let xt = 1t
Σt
=1 x ;
t = 1t
Σt
=1 ; yt = 1t
Σt
=1 y ; gt = 1t
Σt
=1 g :
xt+1 =arg min
x2X
{
g
⊤
t x
⊤
t B
⊤
x +
2t
∥B
⊤
x∥2
+(B
⊤
xt yt )
⊤
B
⊤
x +
1
2t
∥x∥2
Gt
}
;
yt+1 =arg min
y2Y
{
(y)
⊤
t (B
⊤
xt+1 y) +
2
∥B
⊤
xt+1 y∥2
}
;
t+1 =t (B
⊤
xt+1 yt+1):
The update rule of yt+1 and xt+1 are same as the ordinary ADMM.
yt+1 = prox(B
⊤
xt+1 t=j )
19 / 49
25. Proposed method 2: OPG-ADMM
Ordinary OPG: wt+1 = arg minw
{
⟨gt ;w⟩ + ~ (w) + 1
2t
∥w wt∥2
}
:
OPG-ADMM
xt+1 =arg min
x2X
{
g
⊤
t x
⊤
t (B
⊤
x yt ) +
2
∥B
⊤
x yt∥2 +
1
2t
∥x xt∥2
Gt
}
;
yt+1 =prox(B
⊤
xt+1 t=j );
t+1 =t (B
⊤
xt+1 yt+1):
The update rule of yt+1 and xt+1 are same as the ordinary ADMM.
20 / 49
29. ed.
(Gt =
I
t tBB⊤ for RDA-ADMM, and Gt =
I tBB⊤ for OPG-ADMM)
Online stochastic ADMM
1 Update of x
(RDA-ADMM) xt+1 = X
[
t
fgt B(
t B
]
;
⊤
xt + yt )g
(OPG-ADMM) xt+1 = X
[
t
fgt B(t B
⊤
xt + yt )g + xt
]
:
2 Update of y
yt+1 = prox(B
⊤
xt+1 t=j ):
3 Update of
t+1 = t (B
⊤
xt+1 yt+1):
Fast computation.
Easy implementation.
21 / 49
30. Convergence analysis
We bound the expected risk:
Expected risk
F(x) = EZ [f (Z; x)] + ~ (x):
Assumptions:
(A1) 9G s.t. 8g 2 @x f (z; x) satis
31. es ∥g∥ G for all z; x.
(A2) 9L s.t. 8g 2 @ (y) satis
37. es ∥x∥ R.
Theorem (Convergence rate of RDA-ADMM)
Under (A1), (A2), (A3), we have
Ez1:T1 [F(xT ) F(x
)] 1
T
ΣT
t=2
t1
2(t 1)
G2 +
T
∥x
∥2 +
K
T
:
Theorem (Convergence rate of OPG-ADMM)
Under (A1), (A2), (A3), we have
Ez1:T1 [F(xT ) F(x
)] 1
2T
ΣT
t=2 max
{
t
t1
; 0
}
R2
+ 1T
ΣT
t=1
t
2 G2 + K
T :
Both methods have convergence rate O
(
p1
T
)
by letting t = 0
p
t for
RDA-ADMM and t = 0=
p
t for OPG-ADMM.
23 / 49
38. Convergence rate: strongly convex
(A4) There exist f ; 0 and P;Q ⪰ O such that
EZ [f (Z; x) + (x
′ x)
⊤∇x f (Z; x)] +
f
2
∥x x
′∥2
P
EZ [f (Z; x
′
)];
(y) + (y
′ y)
⊤∇ (y) +
2
∥y y
′∥2
Q;
for all x; x′ 2 X and y; y′ 2 Y, and 9 0 satisfying
I ⪯ f P +
2 +
Q:
The update rule of RDA-ADMM is modi
39. ed as
xt+1 = arg min
x2X
{
g
⊤
t x
⊤
t B
⊤
x +
2t
∥B
⊤
x∥2 + (B
⊤
xt yt )
⊤
B
⊤
x +
1
2t
∥x∥2
Gt
+
2
∥x xt∥2
}
:
24 / 49
40. Convergence analysis: strongly convex
Theorem (Convergence rate of RDA-ADMM)
Under (A1), (A2), (A3), (A4), we have
Ez1:T1 [F(xT ) F(x
)] 1
2T
ΣT
t=2
1
t1
t1
+ t
G2 +
T
∥x
∥2 +
K
T
:
Theorem (Convergence rate of OPG-ADMM)
Under (A1), (A2), (A3), (A4), we have
Ez1:T1 [F(xT ) F(x
)] 1
2T
ΣT
t=2 max
{
t
t1
}
R2
; 0
+ 1
T
ΣT
t=1
t
2 G2 + K
T :
Both methods have convergence rate O
(
log(T)
T
)
by letting t = 0t for
RDA-ADMM and t = 0=t for OPG-ADMM.
25 / 49
41. Numerical experiments
10
1
10
2
10
3
10
-2
10
-1
10
0
CPU time (s)
Expected Risk
OPG−ADMM
RDA−ADMM
NL−ADMM
RDA
OPG
Figure: Arti
42. cial data: 1024 dim,
512 sample, Overlapped group lasso.
10
0
10
1
10
2
10
3
10
-0.8
10
-0.7
10
-0.6
CPU time (s)
Classification Error
OPG−ADMM
RDA−ADMM
NL−ADMM
RDA
Figure: Real data (Adult, a9a):
15,252 dim, 32,561 sample,
Overlapped group lasso + ℓ1 reg.
26 / 49
43. Outline
1 Problem settings
2 Stochastic ADMM for online data
3 Stochastic ADMM for batch data
27 / 49
46. ned by
1
n
Σn
i=1
fi (z
⊤
i w) + (B
⊤
w):
We construct a method that
observes few samples per iteration (like online method),
converges linearly (unlike online method).
28 / 49
47. Dual problem
The dual formulation is utilized to obtain our algorithm.
Primal: min
w
1
n
Σn
i=1
fi (z
⊤
i w) + (B
⊤
w)
Dual problem
min
x2Rn;y2Rd
1
n
Σn
i=1
f
i (xi ) +
(
y
n
)
s:t: Zx + By = 0:
where Z = [z1; z2; : : : ; zn] 2 Rpn.
Optimality condition:
⊤
i w
z
2 ∇f
i (x
i );
1
n
y
2 ∇ (u)j
u=B⊤w ; Zx
+ By
= 0:
⋆ Each coordinate xi corresponds to each sample zi .
29 / 49
48. Dual of a loss function
loss function
0
0 1
ℓ(y; x) = log(1 + exp(yx)) ℓ(y; u) = yu log(yu) + (1 + yu) log(1 + yu)
30 / 49
49. Dual of regularization functions
regularization function
0
ℓ1 and elastic-net regularization dual
31 / 49
50. Proposed algorithm
Basic algorithm
Split the index set f1; : : : ; ng into K groups (I1; I2; : : : ; IK).
For each t = 1; 2; : : :
Choose k 2 f1; : : : ;Kg uniformly at random, and set I = Ik .
y(t) arg min
y
{
n
(y=n) ⟨w(t1); Zx(t1)+By⟩
+
2
∥Zx(t1) + By∥2 +
1
2
∥y y(t1)∥2
Q
}
;
x(t)
I
arg min
xI
{Σ
i (xi ) ⟨w(t1); ZI xI + By(t)⟩
i2I f
+
2
∥ZI xI +ZnI x(t1)
nI +By(t)∥2+
1
2
∥xI x(t1)
I
∥2
GI ;I
}
;
w(t) w(t1)
fn(Zx(t) + By(t))
(n n=K)(Zx(t1) + By(t1))g:
where Q; G are some appropriate positive semide
52. We observe only a few samples at each iteration.
Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a).
If K = n, we need to observe only one sample at each iteration.
33 / 49
54. ed algorithm
Setting
Q = (BId B
⊤
B); GI ;I = (Z;I IjI j Z
⊤
I ZI );
then, by the relation prox(qj ) + prox(qj ) = q, we obtain the following
update rule:
Simpli
55. ed algorithm
For q(t) = y(t1) + B
⊤
B
fw(t1) (Zx(t1) + By(t1))g, let
y(t) q(t) prox(q(t)jn (B )=(B));
I = x(t1)
I + Z
For p(t)
⊤
I
Z;I
fw(t1) (Zx(t1) + By(t))g, let
x(t)
i
prox(p(t)
i
i =(Z;I )) (8i 2 I ):
jf
⋆ The update of x can be parallelized.
34 / 49
56. Convergence Analysis
x: the optimal variable of x
Y: the set of optimal variables (not necessarily unique)
w: the optimal variable of w
Assumption:
There exits v 0 such that, 8xi 2 R,
f
i (xi ) f
i (x
i ) ⟨∇f
i (x
i ); xi x
i
⟩ +
v∥xi x
i
∥2
2
:
9h; v 0 such that, for all y; u, there exists by 2 Y such that
(y=n)
(by
=n) ⟨B
⊤
w
; y=n by
=n⟩ +
v
2
∥PKer(B)(y=n by
=n)∥2;
(u) (B
⊤
w
) ⟨y
=n; u B
⊤
w
⟩ +
h
2
∥u B
⊤
w
∥2:
B⊤ is injective.
35 / 49
57. FD(x; y) := 1
n
Σn
i=1 f
i (xi ) + ( y
n ) ⟨w; Z x
n
B y
n
⟩:
RD(x; y;w) := FD(x; y) FD(x; y) + 1
2n2
∥w w∥2 + (1
)
2n
∥Zx +
By∥2 + 1
2n
∥x x∥2v
Ip+H + 1
2nK
∥y y∥2
Q:
Theorem (Linear convergence of SDCA-ADMM)
Let H be a matrix such that HI ;I = Z⊤
I ZI + GI ;I for all I 2 fI1; : : : ; IK g,
= min
{
v
; hmin(B
4(v+max(H)) ⊤
B)
2 maxf1=n;4h;4hmax(Q)
g ; Kv =n
4max(Q) ; Kvmin(BB
⊤
)
4max(Q)(max(Z⊤Z)+4v)
}
;
= 1
4n ; and C1 = RD(x(0); y(0);w(0)); then we have that
(dual residual) E[RD(x(t); y(t);w(t))]
(
1
K
)t
C1;
(primal variable) E[∥w(t) w
∥2] n
2
(
1
K
)t
C1:
For t C′ K
log
(C′′n
ϵ
)
, we have that E[∥w(t) w∥2] ϵ.
36 / 49
69. Numerical Experiments (Real data): Setting
Real data: Graph regularization:
~ (w) = C1
Σp
i=1
jwi j+C2
Σ
(i ;j)2E
jwi wj j+0:01(C1
Σp
i=1
jwi j2+C2
Σ
(i ;j)2E
jwi wj j2)
where E is the set of edges obtained from the similarity matrix.
The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007;
Banerjee et al., 2008).
41 / 49
70. Numerical Experiments (Real data): Results
5 10 15 20 25 30 35 40 45
0.175
0.174
0.173
0.172
0.171
0.17
0.169
0.168
0.167
0.166
CPU time (s)
Empirical Risk
(a) Objective function
10
0
10
1
0.19
0.188
0.186
0.184
0.182
0.18
0.178
0.176
0.174
0.172
0.17
CPU time (s)
Test Loss
(b) Test loss
10
0
10
1
0.14
0.138
0.136
0.134
0.132
0.13
0.128
0.126
0.124
0.122
0.12
CPU time (s)
Classification Error
OPG−ADMM
RDA−ADMM
OL−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batch
size is n=K = 50.
42 / 49
71. Numerical Experiments (Real data): Results
0 50 100 150 200
0.5
0.45
0.4
0.35
0.3
0.25
0.2
CPU time (s)
Empirical Risk
(a) Objective function
0 50 100 150 200
0.45
0.4
0.35
0.3
0.25
0.2
CPU time (s)
Test Loss
(b) Test loss
0 50 100 150 200
0.18
0.175
0.17
0.165
0.16
0.155
0.15
CPU time (s)
Classification Error
OPG−ADMM
RDA−ADMM
OL−ADMM
SDCA−ADMM
(c) Class. Error
Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch size
n=K = 50.
43 / 49
72. Summary
A new stochastic optimization methods with ADMM is proposed.
Applicable to wide range of regularization functions.
Easy implementation.
Online setting:
RDA-ADMM, OPG-ADMM
O(1=
p
T) in general, and O(log(T)=T) for strongly convex objective.
Batch setting:
SDCA-ADMM (Stochastic dual coordinate descent ADMM)
Linear convergence with strong convexity around the optimum.
The convergence rate is controlled by the mini-batch size K.
44 / 49
73. References I
O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. Model selection through
sparse maximum likelihood estimation for multivariate gaussian or
binary data. Journal of Machine Learning Research, 9:485{516, 2008.
J. Duchi and Y. Singer. Efficient online and batch learning using forward
backward splitting. Journal of Machine Learning Research, 10:
2873{2908, 2009.
D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear
variational problems via
74. nite-element approximations. Computers
Mathematics with Applications, 2:17{40, 1976.
M. Hestenes. Multiplier and gradient methods. Journal of Optimization
Theory Applications, 4:303{320, 1969.
45 / 49
75. References II
R. Johnson and T. Zhang. Accelerating stochastic gradient descent using
predictive variance reduction. In C. Burges, L. Bottou, M. Welling,
Z. Ghahramani, and K. Weinberger, editors, Advances in Neural
Information Processing Systems 26, pages 315{323. Curran Associates,
Inc., 2013. URL http://papers.nips.cc/paper/
4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction.
pdf.
N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with
an exponential convergence rate for strongly-convex optimization with
76. nite training sets. In Advances in Neural Information Processing
Systems 25, 2013.
Y. Nesterov. Gradient methods for minimizing composite objective
function. Technical Report 76, Center for Operations Research and
Econometrics (CORE), Catholic University of Louvain (UCL), 2007.
46 / 49
77. References III
Y. Nesterov. Primal-dual subgradient methods for convex problems.
Mathematical Programming, Series B, 120:221{259, 2009.
H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating
direction method of multipliers. In Proceedings of the 30th International
Conference on Machine Learning, 2013.
M. Powell. A method for nonlinear constraints in minimization problems.
In R. Fletcher, editor, Optimization, pages 283{298. Academic Press,
London, New York, 1969.
R. T. Rockafellar. Augmented Lagrangians and applications of the
proximal point algorithm in convex programming. Mathematics of
Operations Research, 1:97{116, 1976.
S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual
coordinate ascent. In Advances in Neural Information Processing
Systems 26, 2013a.
47 / 49
78. References IV
S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate
ascent. Technical report, 2013b. arXiv:1211.2717.
T. Suzuki. Dual averaging and proximal gradient descent for online
alternating direction multiplier method. In Proceedings of the 30th
International Conference on Machine Learning, pages 392{400, 2013.
T. Suzuki. Stochastic dual coordinate ascent with alternating direction
method of multipliers. In Proceedings of the 31th International
Conference on Machine Learning, pages 736{744, 2014.
M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal and
dual methods for SVMs. In Proceedings of the 30th International
Conference on Machine Learning, 2013.
H. Wang and A. Banerjee. Online alternating direction method. In
Proceedings of the 29th International Conference on Machine Learning,
2012.
48 / 49
79. References V
L. Xiao. Dual averaging methods for regularized stochastic learning and
online optimization. In Advances in Neural Information Processing
Systems 23, 2009.
M. Yuan and Y. Lin. Model selection and estimation in the gaussian
graphical model. Biometrika, 94(1):19{35, 2007.
49 / 49