Stochastic Alternating Direction Method of Multipliers

1. Stochastic Alternating Direction Method of Multipliers Taiji Suzuki Tokyo Institute of Technology 1 / 49

2. Goal of this study Machine learning problems with a structured regularization. min w2Rp 1 n Σn i=1 fi (z⊤ i w) + (B⊤w) We propose stochastic optimization methods to solve the above problem. Stochastic optimization + Alternating Direction Method of Multipliers (ADMM) Advantageous points Few samples are needed for each update. (We don't need to read all the samples for each iteration.) Wide range of applications Easy implementation Nice convergence properties Parallelizable 2 / 49

3. Proposed method and existing methods Stochastic methods for regularized learning problems. Normal ADMM Online Proximal gradient type (Nesterov, 2007) Online-ADMM (Wang and Banerjee, 2012) FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang et al., 2013) Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013) RDA (Xiao, 2009) Batch SDCA(Shalev-Shwartz and Zhang, 2013b) SDCA-ADMM (Suzuki, 2014) (Stochastic Dual Coordinate Ascent) SAG (Le Roux et al., 2013) (Stochastic Averaging Gradient) SVRG (Johnson and Zhang, 2013) (Stochastic Variance Reduced Gradient) Online type: O(1= p T) in general, O(log(T)=T) for a strongly convex objective. Batch type: linear convergence. 3 / 49

4. Outline 1 Problem settings 2 Stochastic ADMM for online data 3 Stochastic ADMM for batch data 4 / 49

5. High dimensional data analysis Bio-informatics Text data Image data 5 / 49

6. High dimensional data analysis Redundant information deteriorates the estimation accuracy. Bio-informatics Text data Image data 5 / 49

7. Sparse estimation Cut off redundant information → sparsity R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267{288. Citation: 10185 (25th/May/2014) 6 / 49

8. Regularized learning problem Lasso: min w2Rp 1 n Σn i=1 ⊤ i w)2 + ∥w∥| {z }1 (yi z : regularization Sparsity: Useful for high dimensional learning problem. Sparsity inducing regularization is usually non-smooth. 7 / 49

9. Regularized learning problem Lasso: min w2Rp 1 n Σn i=1 ⊤ i w)2 + ∥w∥| {z }1 (yi z : regularization Sparsity: Useful for high dimensional learning problem. Sparsity inducing regularization is usually non-smooth. General regularized learning problem: min w2Rp 1 n Σn i=1 ⊤ i w) + ~ (w): fi (z 7 / 49

10. Proximal mapping Regularized learning problem: min w2Rp 1 n Σn i=1 ⊤ i w) + ~ (w): fi (z gt 2 @w ( 1 n Σn i=1 ft (z⊤ ) jw=wt ; gt = 1 t w) t Σt =1 g : Proximal gradient descent: wt+1 = arg minw2Rp { g⊤ t w + ~ (w) + 1 2t ∥w wt∥2 } : Regularized dual averaging: wt+1 = arg minw2Rp { g⊤ t w + ~ (w) + 1 2t ∥w∥2 } : A key computation is the proximal mapping: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : 8 / 49

11. Example of Proximal mapping: ℓ1 regularization prox(qjC∥ ∥1) = arg min x { C∥x∥1 + 1 2 ∥x q∥2 } = (sign(qj ) max(jqj j C; 0))j : ! Soft-thresholding function. Analytic form! 9 / 49

12. Motivation Structured regularization with large sample min w2Rp 1 n Σn i=1 fi (z⊤ i w) | {z } large samples + ~ | {(wz }) complicated Difficulties (complex regularization and many samples) Proximal mapping can not be easily computed: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : Computation on the loss function is heavy for large number of samples. 10 / 49

13. Motivation Structured regularization with large sample min w2Rp 1 n Σn i=1 fi (z⊤ i w) | {z } large samples + ~ | {(wz }) complicated Difficulties (complex regularization and many samples) Proximal mapping can not be easily computed: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : ! ADMM (Alternating Direction Multiplier Method). Computation on the loss function is heavy for large number of samples. ! Stochastic optimization. 10 / 49

14. Examples Overlapped group lasso ~ (w) = C Σ g2G ∥wg∥ The groups have over- laps. It is difficult to compute the proximal mapping. 11 / 49

15. Examples Overlapped group lasso ~ (w) = C Σ g2G ∥wg∥ The groups have over- laps. It is difficult to compute the proximal mapping. Solution: Prepare for which proximal mapping is easily computable. Let (B⊤w) = ~ (w), and utilize the proximal mapping w.r.t. . BTw 11 / 49

16. Decomposition technique Decompose into independent groups: B ⊤ w = (y) = C Σ g′2G′ ∥yg′∥ prox(qj ) = ( qg′max { 1 C }) ∥qg′∥; 0 g′2G′ Graph regularization (Jacob et al., 2009), Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al., 2011) Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy, 2013). 12 / 49

17. Another example Graph regularization (Jacob et al., 2009) ~ (w) = C Σ (i ;j)2E jwi wj j: (y) = C Σ e2E jye j; B ⊤ w = (wi wj )(i ;j)2E ) 8 : (B⊤w) = ~ (w); prox(qj ) = ( qemax { 1 C jqe }) j ; 0 e2E : Soft-Thresholding function. 13 / 49

18. Optimizing composite objective function min w 1 n Σn i=1 ⊤ i w) + (B fi (z ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: Augmented Lagrangian L(w; y; ) = 1 n Σ i fi (z ⊤ i w) + (y) + ⊤ (y B ⊤ w) + 2 ∥y B ⊤ w∥2 sup infw;y L(w; y; ) yields the optimization of the original problem. The augmented Lagrangian is the basis of the method of multipliers (Hestenes, 1969; Powell, 1969; Rockafellar, 1976). 14 / 49

19. ADMM min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: L(w; y; ) = 1 n Σ fi (z⊤ i w) + (y) + ⊤(y B⊤w) + 2 ∥y B⊤w∥2. ADMM (Gabay and Mercier, 1976) wt = arg min w 1 n Σn i=1 ⊤ i w) + fi (z ⊤ t (B ⊤ w) + 2 ∥yt1 B ⊤ w∥2 yt = arg min y ⊤ t y + (y) + 2 ∥y B ⊤ wt∥2(= prox(B ⊤ wt t=j =)) t = t1 (B ⊤ wt yt ) The computational difficulty of proximal mapping is resolved. 15 / 49

20. ADMM min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: L(w; y; ) = 1 n Σ fi (z⊤ i w) + (y) + ⊤(y B⊤w) + 2 ∥y B⊤w∥2. ADMM (Gabay and Mercier, 1976) wt = arg min w 1 n Σn i=1 ⊤ i w) + fi (z ⊤ t (B ⊤ w) + 2 ∥yt1 B ⊤ w∥2 yt = arg min y ⊤ t y + (y) + 2 ∥y B ⊤ wt∥2(= prox(B ⊤ wt t=j =)) t = t1 (B ⊤ wt yt ) The computational difficulty of proximal mapping is resolved. However, the computation of 1n Σn i=1 fi (z⊤ i w) is still heavy. ! Stochastic version of ADMM. Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013) Batch type: Suzuki (2014) ! linear convergence. 15 / 49

22. Online setting z t-1 z t w t w t+1 P(z) 2EVHUYH 8SGDWH 2EVHUYH 8SGDWH min w 1 T ΣT t=1 f (zt ;w) + ~ (w) We don't want to read all the samples. wt is updated per one (or few) observation. 17 / 49

23. Existing oneline optimization method gt 2 @ft (wt ); gt = 1t Σt =1 g (ft (w) = f (zt ;w)) OPG (Online Proximal Gradient Descent) wt+1 = arg min w { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w wt∥2 } RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009)) wt+1 = arg min w { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w∥2 } These update rule is computed by a proximal mapping associated with ~ . Efficient for a simple regularization such as ℓ1 regularization. How about structured regularization? ! ADMM. 18 / 49

24. Proposed method 1: RDA-ADMM Ordinary RDA: wt+1 = arg minw { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w∥2 } RDA-ADMM Let xt = 1t Σt =1 x ; t = 1t Σt =1 ; yt = 1t Σt =1 y ; gt = 1t Σt =1 g : xt+1 =arg min x2X { g ⊤ t x ⊤ t B ⊤ x + 2t ∥B ⊤ x∥2 +(B ⊤ xt yt ) ⊤ B ⊤ x + 1 2t ∥x∥2 Gt } ; yt+1 =arg min y2Y { (y) ⊤ t (B ⊤ xt+1 y) + 2 ∥B ⊤ xt+1 y∥2 } ; t+1 =t (B ⊤ xt+1 yt+1): The update rule of yt+1 and xt+1 are same as the ordinary ADMM. yt+1 = prox(B ⊤ xt+1 t=j ) 19 / 49

25. Proposed method 2: OPG-ADMM Ordinary OPG: wt+1 = arg minw { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w wt∥2 } : OPG-ADMM xt+1 =arg min x2X { g ⊤ t x ⊤ t (B ⊤ x yt ) + 2 ∥B ⊤ x yt∥2 + 1 2t ∥x xt∥2 Gt } ; yt+1 =prox(B ⊤ xt+1 t=j ); t+1 =t (B ⊤ xt+1 yt+1): The update rule of yt+1 and xt+1 are same as the ordinary ADMM. 20 / 49

26. Simpli

27. ed version Letting Gt be a speci

28. c form, the update rule is much simpli

29. ed. (Gt = I t tBB⊤ for RDA-ADMM, and Gt = I tBB⊤ for OPG-ADMM) Online stochastic ADMM 1 Update of x (RDA-ADMM) xt+1 = X [ t fgt B( t B ] ; ⊤ xt + yt )g (OPG-ADMM) xt+1 = X [ t fgt B(t B ⊤ xt + yt )g + xt ] : 2 Update of y yt+1 = prox(B ⊤ xt+1 t=j ): 3 Update of t+1 = t (B ⊤ xt+1 yt+1): Fast computation. Easy implementation. 21 / 49

30. Convergence analysis We bound the expected risk: Expected risk F(x) = EZ [f (Z; x)] + ~ (x): Assumptions: (A1) 9G s.t. 8g 2 @x f (z; x) satis

31. es ∥g∥ G for all z; x. (A2) 9L s.t. 8g 2 @ (y) satis

32. es ∥g∥ L for all y. (A3) 9R s.t. 8x 2 X satis

33. es ∥x∥ R. 22 / 49

34. Convergence rate: bounded gradient (A1) 9G s.t. 8g 2 @x f (z; x) satis

35. es ∥g∥ G for all z; x. (A2) 9L s.t. 8g 2 @ (y) satis

36. es ∥g∥ L for all y. (A3) 9R s.t. 8x 2 X satis

37. es ∥x∥ R. Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), we have Ez1:T1 [F(xT ) F(x )] 1 T ΣT t=2 t1 2(t 1) G2 + T ∥x ∥2 + K T : Theorem (Convergence rate of OPG-ADMM) Under (A1), (A2), (A3), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 max { t t1 ; 0 } R2 + 1T ΣT t=1 t 2 G2 + K T : Both methods have convergence rate O ( p1 T ) by letting t = 0 p t for RDA-ADMM and t = 0= p t for OPG-ADMM. 23 / 49

38. Convergence rate: strongly convex (A4) There exist f ; 0 and P;Q ⪰ O such that EZ [f (Z; x) + (x ′ x) ⊤∇x f (Z; x)] + f 2 ∥x x ′∥2 P EZ [f (Z; x ′ )]; (y) + (y ′ y) ⊤∇ (y) + 2 ∥y y ′∥2 Q; for all x; x′ 2 X and y; y′ 2 Y, and 9 0 satisfying I ⪯ f P + 2 + Q: The update rule of RDA-ADMM is modi

39. ed as xt+1 = arg min x2X { g ⊤ t x ⊤ t B ⊤ x + 2t ∥B ⊤ x∥2 + (B ⊤ xt yt ) ⊤ B ⊤ x + 1 2t ∥x∥2 Gt + 2 ∥x xt∥2 } : 24 / 49

40. Convergence analysis: strongly convex Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), (A4), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 1 t1 t1 + t G2 + T ∥x ∥2 + K T : Theorem (Convergence rate of OPG-ADMM) Under (A1), (A2), (A3), (A4), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 max { t t1 } R2 ; 0 + 1 T ΣT t=1 t 2 G2 + K T : Both methods have convergence rate O ( log(T) T ) by letting t = 0t for RDA-ADMM and t = 0=t for OPG-ADMM. 25 / 49

41. Numerical experiments 10 1 10 2 10 3 10 -2 10 -1 10 0 CPU time (s) Expected Risk OPG−ADMM RDA−ADMM NL−ADMM RDA OPG Figure: Arti

42. cial data: 1024 dim, 512 sample, Overlapped group lasso. 10 0 10 1 10 2 10 3 10 -0.8 10 -0.7 10 -0.6 CPU time (s) Classification Error OPG−ADMM RDA−ADMM NL−ADMM RDA Figure: Real data (Adult, a9a): 15,252 dim, 32,561 sample, Overlapped group lasso + ℓ1 reg. 26 / 49

44. Batch setting In the batch setting, the data are

45. xed. We just minimize the objective function de

46. ned by 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w): We construct a method that observes few samples per iteration (like online method), converges linearly (unlike online method). 28 / 49

47. Dual problem The dual formulation is utilized to obtain our algorithm. Primal: min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) Dual problem min x2Rn;y2Rd 1 n Σn i=1 f i (xi ) + ( y n ) s:t: Zx + By = 0: where Z = [z1; z2; : : : ; zn] 2 Rpn. Optimality condition: ⊤ i w z 2 ∇f i (x i ); 1 n y 2 ∇ (u)j u=B⊤w ; Zx + By = 0: ⋆ Each coordinate xi corresponds to each sample zi . 29 / 49

48. Dual of a loss function loss function 0 0 1 ℓ(y; x) = log(1 + exp(yx)) ℓ(y; u) = yu log(yu) + (1 + yu) log(1 + yu) 30 / 49

49. Dual of regularization functions regularization function 0 ℓ1 and elastic-net regularization dual 31 / 49

50. Proposed algorithm Basic algorithm Split the index set f1; : : : ; ng into K groups (I1; I2; : : : ; IK). For each t = 1; 2; : : : Choose k 2 f1; : : : ;Kg uniformly at random, and set I = Ik . y(t) arg min y { n (y=n) ⟨w(t1); Zx(t1)+By⟩ + 2 ∥Zx(t1) + By∥2 + 1 2 ∥y y(t1)∥2 Q } ; x(t) I arg min xI {Σ i (xi ) ⟨w(t1); ZI xI + By(t)⟩ i2I f + 2 ∥ZI xI +ZnI x(t1) nI +By(t)∥2+ 1 2 ∥xI x(t1) I ∥2 GI ;I } ; w(t) w(t1) fn(Zx(t) + By(t)) (n n=K)(Zx(t1) + By(t1))g: where Q; G are some appropriate positive semide

51. nite matrices. 32 / 49

52. We observe only a few samples at each iteration. Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a). If K = n, we need to observe only one sample at each iteration. 33 / 49

53. Simpli

54. ed algorithm Setting Q = (BId B ⊤ B); GI ;I = (Z;I IjI j Z ⊤ I ZI ); then, by the relation prox(qj ) + prox(qj ) = q, we obtain the following update rule: Simpli

55. ed algorithm For q(t) = y(t1) + B ⊤ B fw(t1) (Zx(t1) + By(t1))g, let y(t) q(t) prox(q(t)jn (B )=(B)); I = x(t1) I + Z For p(t) ⊤ I Z;I fw(t1) (Zx(t1) + By(t))g, let x(t) i prox(p(t) i i =(Z;I )) (8i 2 I ): jf ⋆ The update of x can be parallelized. 34 / 49

56. Convergence Analysis x: the optimal variable of x Y: the set of optimal variables (not necessarily unique) w: the optimal variable of w Assumption: There exits v 0 such that, 8xi 2 R, f i (xi ) f i (x i ) ⟨∇f i (x i ); xi x i ⟩ + v∥xi x i ∥2 2 : 9h; v 0 such that, for all y; u, there exists by 2 Y such that (y=n) (by =n) ⟨B ⊤ w ; y=n by =n⟩ + v 2 ∥PKer(B)(y=n by =n)∥2; (u) (B ⊤ w ) ⟨y =n; u B ⊤ w ⟩ + h 2 ∥u B ⊤ w ∥2: B⊤ is injective. 35 / 49

57. FD(x; y) := 1 n Σn i=1 f i (xi ) + ( y n ) ⟨w; Z x n B y n ⟩: RD(x; y;w) := FD(x; y) FD(x; y) + 1 2n2 ∥w w∥2 + (1 ) 2n ∥Zx + By∥2 + 1 2n ∥x x∥2v Ip+H + 1 2nK ∥y y∥2 Q: Theorem (Linear convergence of SDCA-ADMM) Let H be a matrix such that HI ;I = Z⊤ I ZI + GI ;I for all I 2 fI1; : : : ; IK g, = min { v ; hmin(B 4(v+max(H)) ⊤ B) 2 maxf1=n;4h;4hmax(Q) g ; Kv =n 4max(Q) ; Kvmin(BB ⊤ ) 4max(Q)(max(Z⊤Z)+4v) } ; = 1 4n ; and C1 = RD(x(0); y(0);w(0)); then we have that (dual residual) E[RD(x(t); y(t);w(t))] ( 1 K )t C1; (primal variable) E[∥w(t) w ∥2] n 2 ( 1 K )t C1: For t C′ K log (C′′n ϵ ) , we have that E[∥w(t) w∥2] ϵ. 36 / 49

58. Numerical Experiments: Loss function Binary classi

59. cation. Smoothed hinge loss: fi (u) = 8 : 0; (yiu 1); 1 yiu; (yiu 0); 2 1 2 (1 yiu)2; (otherwise): 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.5 0 0.5 1 1.5 ⇒ proximal mapping is analytically obtained. prox(ujf i =C) = 8 : Cuyi 1+C (1 Cuyi1 1+C 0); yi (1 Cuyi1 1+C ); 0 (otherwise): 37 / 49

60. Numerical Experiments (Arti

61. cial data): Setting Arti

62. cial data: Overlapped group regularization: Σ32 ~ (X) = C( i=1 ∥X:;i∥ + Σ32 j=1 ∥Xj ;:∥ + 0:01 Σ i ;j X2 i ;j=2); X 2 R3232: 38 / 49

64. cial data): Results 5 10 15 20 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 CPU time (s) Empirical Risk (a) Excess objective value 2.5 2 1.5 1 0.5 10 -1 10 0 10 1 0 CPU time (s) Test Loss (b) Test loss 0.3 0.25 0.2 0.15 0.1 10 -1 10 0 10 1 0.05 CPU time (s) Classification Error RDA OPG−ADMM RDA−ADMM OL−ADMM Batch−ADMM SDCA−ADMM (c) Class. Error Figure: Arti

65. cial data (n=5,120). Overlapped group lasso. Mini-batch size n=K = 50. 39 / 49

67. cial data): Results 100 200 300 400 500 10 -6 10 -4 10 -2 10 0 CPU time (s) Empirical Risk (a) Excess objective value 10 0 10 1 10 2 3.5 3 2.5 2 1.5 1 0.5 0 CPU time (s) Test Loss (b) Test loss 10 0 10 1 10 2 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 CPU time (s) Classification Error RDA OPG−ADMM RDA−ADMM OL−ADMM Batch−ADMM SDCA−ADMM (c) Class. Error Figure: Arti

68. cial data (n=51,200). Overlapped group lasso. Mini-batch size n=K = 50. 40 / 49

69. Numerical Experiments (Real data): Setting Real data: Graph regularization: ~ (w) = C1 Σp i=1 jwi j+C2 Σ (i ;j)2E jwi wj j+0:01(C1 Σp i=1 jwi j2+C2 Σ (i ;j)2E jwi wj j2) where E is the set of edges obtained from the similarity matrix. The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007; Banerjee et al., 2008). 41 / 49

70. Numerical Experiments (Real data): Results 5 10 15 20 25 30 35 40 45 0.175 0.174 0.173 0.172 0.171 0.17 0.169 0.168 0.167 0.166 CPU time (s) Empirical Risk (a) Objective function 10 0 10 1 0.19 0.188 0.186 0.184 0.182 0.18 0.178 0.176 0.174 0.172 0.17 CPU time (s) Test Loss (b) Test loss 10 0 10 1 0.14 0.138 0.136 0.134 0.132 0.13 0.128 0.126 0.124 0.122 0.12 CPU time (s) Classification Error OPG−ADMM RDA−ADMM OL−ADMM SDCA−ADMM (c) Class. Error Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batch size is n=K = 50. 42 / 49

71. Numerical Experiments (Real data): Results 0 50 100 150 200 0.5 0.45 0.4 0.35 0.3 0.25 0.2 CPU time (s) Empirical Risk (a) Objective function 0 50 100 150 200 0.45 0.4 0.35 0.3 0.25 0.2 CPU time (s) Test Loss (b) Test loss 0 50 100 150 200 0.18 0.175 0.17 0.165 0.16 0.155 0.15 CPU time (s) Classification Error OPG−ADMM RDA−ADMM OL−ADMM SDCA−ADMM (c) Class. Error Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch size n=K = 50. 43 / 49

72. Summary A new stochastic optimization methods with ADMM is proposed. Applicable to wide range of regularization functions. Easy implementation. Online setting: RDA-ADMM, OPG-ADMM O(1= p T) in general, and O(log(T)=T) for strongly convex objective. Batch setting: SDCA-ADMM (Stochastic dual coordinate descent ADMM) Linear convergence with strong convexity around the optimum. The convergence rate is controlled by the mini-batch size K. 44 / 49

73. References I O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485{516, 2008. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: 2873{2908, 2009. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via

74. nite-element approximations. Computers Mathematics with Applications, 2:17{40, 1976. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory Applications, 4:303{320, 1969. 45 / 49

75. References II R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315{323. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction. pdf. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with

76. nite training sets. In Advances in Neural Information Processing Systems 25, 2013. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2007. 46 / 49

77. References III Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, Series B, 120:221{259, 2009. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283{298. Academic Press, London, New York, 1969. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1:97{116, 1976. S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, 2013a. 47 / 49

78. References IV S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. Technical report, 2013b. arXiv:1211.2717. T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392{400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736{744, 2014. M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. H. Wang and A. Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning, 2012. 48 / 49

79. References V L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, 2009. M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19{35, 2007. 49 / 49

Stochastic Alternating Direction Method of Multipliers

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Stochastic Alternating Direction Method of Multipliers

Similar to Stochastic Alternating Direction Method of Multipliers (20)

More from Taiji Suzuki

More from Taiji Suzuki (9)

Recently uploaded

Recently uploaded (20)

Stochastic Alternating Direction Method of Multipliers