SlideShare a Scribd company logo
1 of 54
Download to read offline
Stochastic Alternating Direction Method of Multipliers 
Taiji Suzuki 
Tokyo Institute of Technology 
1 / 49
Goal of this study 
Machine learning problems with a structured regularization. 
min 
w2Rp 
1 
n 
Σn 
i=1 
fi (z⊤ 
i w) +  (B⊤w) 
We propose stochastic optimization methods to solve the above problem. 
Stochastic optimization 
+ 
Alternating Direction Method of Multipliers (ADMM) 
Advantageous points 
Few samples are needed for each update. 
(We don't need to read all the samples for each iteration.) 
Wide range of applications 
Easy implementation 
Nice convergence properties 
Parallelizable 
2 / 49
Proposed method and existing methods 
Stochastic methods for regularized learning problems. 
Normal ADMM 
Online  Proximal gradient type (Nesterov, 
2007) 
Online-ADMM (Wang and Banerjee, 
2012) 
FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang 
et al., 2013) 
 Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013) 
RDA (Xiao, 2009) 
Batch 
SDCA(Shalev-Shwartz and Zhang, 2013b) 
SDCA-ADMM (Suzuki, 2014) 
(Stochastic Dual Coordinate Ascent) 
SAG (Le Roux et al., 2013) 
(Stochastic Averaging Gradient) 
SVRG (Johnson and Zhang, 2013) 
(Stochastic Variance Reduced Gradient) 
Online type: O(1= 
p 
T) in general, 
O(log(T)=T) for a strongly convex objective. 
Batch type: linear convergence. 
3 / 49
Outline 
1 Problem settings 
2 Stochastic ADMM for online data 
3 Stochastic ADMM for batch data 
4 / 49
High dimensional data analysis 
Bio-informatics Text data Image data 
5 / 49
High dimensional data analysis 
Redundant information deteriorates the estimation accuracy. 
Bio-informatics Text data Image data 
5 / 49
Sparse estimation 
Cut off redundant information → sparsity 
R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. 
Statist. Soc B., Vol. 58, No. 1, pages 267{288. Citation: 10185 (25th/May/2014) 
6 / 49
Regularized learning problem 
Lasso: 
min 
w2Rp 
1 
n 
Σn 
i=1 
⊤ 
i w)2 + ∥w∥| {z }1 
(yi  z 
: 
regularization 
Sparsity: Useful for high dimensional learning problem. 
Sparsity inducing regularization is usually non-smooth. 
7 / 49
Regularized learning problem 
Lasso: 
min 
w2Rp 
1 
n 
Σn 
i=1 
⊤ 
i w)2 + ∥w∥| {z }1 
(yi  z 
: 
regularization 
Sparsity: Useful for high dimensional learning problem. 
Sparsity inducing regularization is usually non-smooth. 
General regularized learning problem: 
min 
w2Rp 
1 
n 
Σn 
i=1 
⊤ 
i w) + ~  (w): 
fi (z 
7 / 49
Proximal mapping 
Regularized learning problem: 
min 
w2Rp 
1 
n 
Σn 
i=1 
⊤ 
i w) + ~  (w): 
fi (z 
gt 2 @w 
( 1 
n 
Σn 
i=1 ft (z⊤ 
) 
jw=wt ; gt = 1 
t w) 
t 
Σt 
=1 g : 
Proximal gradient descent: wt+1 = arg minw2Rp 
{ 
g⊤ 
t w + ~  (w) + 1 
2t 
∥w  wt∥2 
} 
: 
Regularized dual averaging: wt+1 = arg minw2Rp 
{ 
g⊤ 
t w + ~  (w) + 1 
2t 
∥w∥2 
} 
: 
A key computation is the proximal mapping: 
prox(qj ~  ) := arg min 
x 
{ 
~  (x) + 
1 
2 
∥x  q∥2 
} 
: 
8 / 49
Example of Proximal mapping: ℓ1 regularization 
prox(qjC∥  ∥1) = arg min 
x 
{ 
C∥x∥1 + 
1 
2 
∥x  q∥2 
} 
= (sign(qj ) max(jqj j  C; 0))j : 
! Soft-thresholding function. Analytic form! 
9 / 49
Motivation 
Structured regularization with large sample 
min 
w2Rp 
1 
n 
Σn 
i=1 
fi (z⊤ 
i w) 
| {z } 
large samples 
+ ~ | {(wz }) 
complicated 
Difficulties (complex regularization and many samples) 
Proximal mapping can not be easily computed: 
prox(qj ~  ) := arg min 
x 
{ 
~  (x) + 
1 
2 
∥x  q∥2 
} 
: 
Computation on the loss function is heavy for large number of 
samples. 
10 / 49
Motivation 
Structured regularization with large sample 
min 
w2Rp 
1 
n 
Σn 
i=1 
fi (z⊤ 
i w) 
| {z } 
large samples 
+ ~ | {(wz }) 
complicated 
Difficulties (complex regularization and many samples) 
Proximal mapping can not be easily computed: 
prox(qj ~  ) := arg min 
x 
{ 
~  (x) + 
1 
2 
∥x  q∥2 
} 
: 
! ADMM (Alternating Direction Multiplier Method). 
Computation on the loss function is heavy for large number of 
samples. 
! Stochastic optimization. 
10 / 49
Examples 
Overlapped group lasso 
~  (w) = C 
Σ 
g2G 
∥wg∥ 
The groups have over- 
laps. 
It is difficult to compute 
the proximal mapping. 
11 / 49
Examples 
Overlapped group lasso 
~  (w) = C 
Σ 
g2G 
∥wg∥ 
The groups have over- 
laps. 
It is difficult to compute 
the proximal mapping. 
Solution: 
Prepare   for which proximal 
mapping is easily computable. 
Let  (B⊤w) = ~  (w), and utilize the 
proximal mapping w.r.t.  . 
BTw 
11 / 49
Decomposition technique 
Decompose into independent groups: 
B 
⊤ 
w = 
 (y) = C 
Σ 
g′2G′ 
∥yg′∥ 
prox(qj ) = 
( 
qg′max 
{ 
1  C 
}) 
∥qg′∥; 0 
g′2G′ 
Graph regularization (Jacob et al., 2009), 
Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al., 
2011) 
Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy, 
2013). 
12 / 49
Another example 
Graph regularization (Jacob et al., 2009) 
~  (w) = C 
Σ 
(i ;j)2E 
jwi  wj j: 
 (y) = C 
Σ 
e2E 
jye 
j; B 
⊤ 
w = (wi  wj )(i ;j)2E 
) 
8 
: 
 (B⊤w) = ~  (w); 
prox(qj ) = 
( 
qemax 
{ 
1  C 
jqe 
}) 
j ; 0 
e2E 
: 
Soft-Thresholding function. 
13 / 49
Optimizing composite objective function 
min 
w 
1 
n 
Σn 
i=1 
⊤ 
i w) +  (B 
fi (z 
⊤ 
w) 
, min 
w;y 
1 
n 
Σn 
i=1 
⊤ 
i w) +  (y) s:t: y = B 
fi (z 
⊤ 
w: 
Augmented Lagrangian 
L(w; y; ) = 
1 
n 
Σ 
i 
fi (z 
⊤ 
i w) +  (y) +  
⊤ 
(y  B 
⊤ 
w) + 
 
2 
∥y  B 
⊤ 
w∥2 
sup infw;y L(w; y; ) yields the optimization of the original problem. 
The augmented Lagrangian is the basis of the method of multipliers 
(Hestenes, 1969; Powell, 1969; Rockafellar, 1976). 
14 / 49
ADMM 
min 
w 
1 
n 
Σn 
i=1 
fi (z 
⊤ 
i w) +  (B 
⊤ 
w) , min 
w;y 
1 
n 
Σn 
i=1 
⊤ 
i w) +  (y) s:t: y = B 
fi (z 
⊤ 
w: 
L(w; y; ) = 1 
n 
Σ 
fi (z⊤ 
i w) +  (y) + ⊤(y  B⊤w) +  
2 
∥y  B⊤w∥2. 
ADMM (Gabay and Mercier, 1976) 
wt = arg min 
w 
1 
n 
Σn 
i=1 
⊤ 
i w) +  
fi (z 
⊤ 
t (B 
⊤ 
w) + 
 
2 
∥yt1  B 
⊤ 
w∥2 
yt = arg min 
y 
⊤ 
t y + 
 (y) +  
 
2 
∥y  B 
⊤ 
wt∥2(= prox(B 
⊤ 
wt  t=j =)) 
t = t1  (B 
⊤ 
wt  yt ) 
The computational difficulty of proximal mapping is resolved. 
15 / 49
ADMM 
min 
w 
1 
n 
Σn 
i=1 
fi (z 
⊤ 
i w) +  (B 
⊤ 
w) , min 
w;y 
1 
n 
Σn 
i=1 
⊤ 
i w) +  (y) s:t: y = B 
fi (z 
⊤ 
w: 
L(w; y; ) = 1 
n 
Σ 
fi (z⊤ 
i w) +  (y) + ⊤(y  B⊤w) +  
2 
∥y  B⊤w∥2. 
ADMM (Gabay and Mercier, 1976) 
wt = arg min 
w 
1 
n 
Σn 
i=1 
⊤ 
i w) +  
fi (z 
⊤ 
t (B 
⊤ 
w) + 
 
2 
∥yt1  B 
⊤ 
w∥2 
yt = arg min 
y 
⊤ 
t y + 
 (y) +  
 
2 
∥y  B 
⊤ 
wt∥2(= prox(B 
⊤ 
wt  t=j =)) 
t = t1  (B 
⊤ 
wt  yt ) 
The computational difficulty of proximal mapping is resolved. 
However, the computation of 1n 
Σn 
i=1 fi (z⊤ 
i w) is still heavy. 
! Stochastic version of ADMM. 
Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013) 
Batch type: Suzuki (2014) ! linear convergence. 
15 / 49
Outline 
1 Problem settings 
2 Stochastic ADMM for online data 
3 Stochastic ADMM for batch data 
16 / 49
Online setting 
z 
t-1 
z 
t 
w 
t 
w 
t+1 
P(z) 
2EVHUYH 8SGDWH 2EVHUYH 8SGDWH 
min 
w 
1 
T 
ΣT 
t=1 
f (zt ;w) + ~  (w) 
We don't want to read all the samples. 
wt is updated per one (or few) observation. 
17 / 49
Existing oneline optimization method 
gt 2 @ft (wt ); gt = 1t 
Σt 
=1 g (ft (w) = f (zt ;w)) 
OPG (Online Proximal Gradient Descent) 
wt+1 = arg min 
w 
{ 
⟨gt ;w⟩ + ~  (w) + 
1 
2t 
∥w  wt∥2 
} 
RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009)) 
wt+1 = arg min 
w 
{ 
⟨gt ;w⟩ + ~  (w) + 
1 
2t 
∥w∥2 
} 
These update rule is computed by a proximal mapping associated with ~  . 
Efficient for a simple regularization such as ℓ1 regularization. 
How about structured regularization? ! ADMM. 
18 / 49
Proposed method 1: RDA-ADMM 
Ordinary RDA: wt+1 = arg minw 
{ 
⟨gt ;w⟩ + ~  (w) + 1 
2t 
∥w∥2 
} 
RDA-ADMM 
Let xt = 1t 
Σt 
=1 x ;  
t = 1t 
Σt 
=1  ; yt = 1t 
Σt 
=1 y ; gt = 1t 
Σt 
=1 g : 
xt+1 =arg min 
x2X 
{ 
g 
⊤ 
t x   
⊤ 
t B 
⊤ 
x + 
 
2t 
∥B 
⊤ 
x∥2 
+(B 
⊤ 
xt  yt ) 
⊤ 
B 
⊤ 
x + 
1 
2t 
∥x∥2 
Gt 
} 
; 
yt+1 =arg min 
y2Y 
{ 
 (y)   
⊤ 
t (B 
⊤ 
xt+1  y) + 
 
2 
∥B 
⊤ 
xt+1  y∥2 
} 
; 
t+1 =t  (B 
⊤ 
xt+1  yt+1): 
The update rule of yt+1 and xt+1 are same as the ordinary ADMM. 
yt+1 = prox(B 
⊤ 
xt+1  t=j ) 
19 / 49
Proposed method 2: OPG-ADMM 
Ordinary OPG: wt+1 = arg minw 
{ 
⟨gt ;w⟩ + ~  (w) + 1 
2t 
∥w  wt∥2 
} 
: 
OPG-ADMM 
xt+1 =arg min 
x2X 
{ 
g 
⊤ 
t x   
⊤ 
t (B 
⊤ 
x  yt ) + 
 
2 
∥B 
⊤ 
x  yt∥2 + 
1 
2t 
∥x  xt∥2 
Gt 
} 
; 
yt+1 =prox(B 
⊤ 
xt+1  t=j ); 
t+1 =t  (B 
⊤ 
xt+1  yt+1): 
The update rule of yt+1 and xt+1 are same as the ordinary ADMM. 
20 / 49
Simpli
ed version 
Letting Gt be a speci
c form, the update rule is much simpli
ed. 
(Gt = 
I   
t tBB⊤ for RDA-ADMM, and Gt = 
I  tBB⊤ for OPG-ADMM) 
Online stochastic ADMM 
1 Update of x 
(RDA-ADMM) xt+1 = X 
[ 
t 

 
fgt  B( 
t  B 
] 
; 
⊤ 
xt + yt )g 
(OPG-ADMM) xt+1 = X 
[ 
t 

 
fgt  B(t  B 
⊤ 
xt + yt )g + xt 
] 
: 
2 Update of y 
yt+1 = prox(B 
⊤ 
xt+1  t=j ): 
3 Update of  
t+1 = t  (B 
⊤ 
xt+1  yt+1): 
Fast computation. 
Easy implementation. 
21 / 49
Convergence analysis 
We bound the expected risk: 
Expected risk 
F(x) = EZ [f (Z; x)] + ~  (x): 
Assumptions: 
(A1) 9G s.t. 8g 2 @x f (z; x) satis
es ∥g∥  G for all z; x. 
(A2) 9L s.t. 8g 2 @ (y) satis
es ∥g∥  L for all y. 
(A3) 9R s.t. 8x 2 X satis
es ∥x∥  R. 
22 / 49
Convergence rate: bounded gradient 
(A1) 9G s.t. 8g 2 @x f (z; x) satis
es ∥g∥  G for all z; x. 
(A2) 9L s.t. 8g 2 @ (y) satis
es ∥g∥  L for all y. 
(A3) 9R s.t. 8x 2 X satis
es ∥x∥  R. 
Theorem (Convergence rate of RDA-ADMM) 
Under (A1), (A2), (A3), we have 
Ez1:T1 [F(xT )  F(x 
 
)]  1 
T 
ΣT 
t=2 
t1 
2(t  1) 
G2 + 

 
T 
∥x 
∥2 + 
K 
T 
: 
Theorem (Convergence rate of OPG-ADMM) 
Under (A1), (A2), (A3), we have 
Ez1:T1 [F(xT )  F(x 
 
)]  1 
2T 
ΣT 
t=2 max 
{ 

 
t 
 
 
t1 
; 0 
} 
R2 
+ 1T 
ΣT 
t=1 
t 
2 G2 + K 
T : 
Both methods have convergence rate O 
( 
p1 
T 
) 
by letting t = 0 
p 
t for 
RDA-ADMM and t = 0= 
p 
t for OPG-ADMM. 
23 / 49
Convergence rate: strongly convex 
(A4) There exist f ;    0 and P;Q ⪰ O such that 
EZ [f (Z; x) + (x 
′  x) 
⊤∇x f (Z; x)] + 
f 
2 
∥x  x 
′∥2 
P 
 EZ [f (Z; x 
′ 
)]; 
 (y) + (y 
′  y) 
⊤∇ (y) + 
  
2 
∥y  y 
′∥2 
Q; 
for all x; x′ 2 X and y; y′ 2 Y, and 9  0 satisfying 
I ⪯ f P + 
  
2 +   
Q: 
The update rule of RDA-ADMM is modi
ed as 
xt+1 = arg min 
x2X 
{ 
g 
⊤ 
t x   
⊤ 
t B 
⊤ 
x + 
 
2t 
∥B 
⊤ 
x∥2 + (B 
⊤ 
xt  yt ) 
⊤ 
B 
⊤ 
x + 
1 
2t 
∥x∥2 
Gt 
+ 
 
2 
∥x  xt∥2 
} 
: 
24 / 49
Convergence analysis: strongly convex 
Theorem (Convergence rate of RDA-ADMM) 
Under (A1), (A2), (A3), (A4), we have 
Ez1:T1 [F(xT )  F(x 
 
)]  1 
2T 
ΣT 
t=2 
1 
t1 
t1 
+ t 
G2 + 

 
T 
∥x 
∥2 + 
K 
T 
: 
Theorem (Convergence rate of OPG-ADMM) 
Under (A1), (A2), (A3), (A4), we have 
Ez1:T1 [F(xT )  F(x 
 
)]  1 
2T 
ΣT 
t=2 max 
{ 

 
t 
 
 
t1 
} 
R2 
 ; 0 
+ 1 
T 
ΣT 
t=1 
t 
2 G2 + K 
T : 
Both methods have convergence rate O 
( 
log(T) 
T 
) 
by letting t = 0t for 
RDA-ADMM and t = 0=t for OPG-ADMM. 
25 / 49
Numerical experiments 
10 
1 
10 
2 
10 
3 
10 
-2 
10 
-1 
10 
0 
CPU time (s) 
Expected Risk 
OPG−ADMM 
RDA−ADMM 
NL−ADMM 
RDA 
OPG 
Figure: Arti
cial data: 1024 dim, 
512 sample, Overlapped group lasso. 
10 
0 
10 
1 
10 
2 
10 
3 
10 
-0.8 
10 
-0.7 
10 
-0.6 
CPU time (s) 
Classification Error 
OPG−ADMM 
RDA−ADMM 
NL−ADMM 
RDA 
Figure: Real data (Adult, a9a): 
15,252 dim, 32,561 sample, 
Overlapped group lasso + ℓ1 reg. 
26 / 49
Outline 
1 Problem settings 
2 Stochastic ADMM for online data 
3 Stochastic ADMM for batch data 
27 / 49
Batch setting 
In the batch setting, the data are
xed. We just minimize the objective 
function de
ned by 
1 
n 
Σn 
i=1 
fi (z 
⊤ 
i w) +  (B 
⊤ 
w): 
  
We construct a method that 
observes few samples per iteration (like online method), 
converges linearly (unlike online method). 
  
28 / 49
Dual problem 
The dual formulation is utilized to obtain our algorithm. 
Primal: min 
w 
1 
n 
Σn 
i=1 
fi (z 
⊤ 
i w) +  (B 
⊤ 
w) 
Dual problem 
min 
x2Rn;y2Rd 
1 
n 
Σn 
i=1 
f 
 
i (xi ) +   
 
( 
y 
n 
) 
s:t: Zx + By = 0: 
where Z = [z1; z2; : : : ; zn] 2 Rpn. 
Optimality condition: 
⊤ 
i w 
z 
 2 ∇f 
 
i (x 
 
i ); 
1 
n 
y 
 2 ∇ (u)j 
u=B⊤w ; Zx 
 
+ By 
 
= 0: 
⋆ Each coordinate xi corresponds to each sample zi . 
29 / 49
Dual of a loss function 
loss function 
0 
0 1 
ℓ(y; x) = log(1 + exp(yx)) ℓ(y; u) = yu log(yu) + (1 + yu) log(1 + yu) 
30 / 49
Dual of regularization functions 
regularization function 
0 
ℓ1 and elastic-net regularization dual 
31 / 49
Proposed algorithm 
Basic algorithm 
Split the index set f1; : : : ; ng into K groups (I1; I2; : : : ; IK). 
For each t = 1; 2; : : : 
Choose k 2 f1; : : : ;Kg uniformly at random, and set I = Ik . 
y(t)  arg min 
y 
{ 
n  
 
(y=n)  ⟨w(t1); Zx(t1)+By⟩ 
+ 
 
2 
∥Zx(t1) + By∥2 + 
1 
2 
∥y  y(t1)∥2 
Q 
} 
; 
x(t) 
I 
 arg min 
xI 
{Σ 
i (xi )  ⟨w(t1); ZI xI + By(t)⟩ 
i2I f  
+ 
 
2 
∥ZI xI +ZnI x(t1) 
nI +By(t)∥2+ 
1 
2 
∥xI  x(t1) 
I 
∥2 
GI ;I 
} 
; 
w(t)  w(t1)  
fn(Zx(t) + By(t)) 
 (n  n=K)(Zx(t1) + By(t1))g: 
where Q; G are some appropriate positive semide
nite matrices. 
32 / 49
We observe only a few samples at each iteration. 
Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a). 
If K = n, we need to observe only one sample at each iteration. 
33 / 49
Simpli
ed algorithm 
Setting 
Q = (BId  B 
⊤ 
B); GI ;I = (Z;I IjI j  Z 
⊤ 
I ZI ); 
then, by the relation prox(qj ) + prox(qj ) = q, we obtain the following 
update rule: 
Simpli

More Related Content

What's hot

Electromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum
Electromagnetic analysis using FDTD method, Biological Effects Of EM SpectrumElectromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum
Electromagnetic analysis using FDTD method, Biological Effects Of EM SpectrumSuleyman Demirel University
 
Lecture 15 DCT, Walsh and Hadamard Transform
Lecture 15 DCT, Walsh and Hadamard TransformLecture 15 DCT, Walsh and Hadamard Transform
Lecture 15 DCT, Walsh and Hadamard TransformVARUN KUMAR
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDValentin De Bortoli
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transformAmr E. Mohamed
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化Akira Tanimoto
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theorySeongwon Hwang
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollIan Huston
 
Dynamic response of oscillators to general excitations
Dynamic response of oscillators to general excitationsDynamic response of oscillators to general excitations
Dynamic response of oscillators to general excitationsUniversity of Glasgow
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for controlSpringer
 
Dynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesDynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesUniversity of Glasgow
 
Linear response theory and TDDFT
Linear response theory and TDDFT Linear response theory and TDDFT
Linear response theory and TDDFT Claudio Attaccalite
 

What's hot (20)

Electromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum
Electromagnetic analysis using FDTD method, Biological Effects Of EM SpectrumElectromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum
Electromagnetic analysis using FDTD method, Biological Effects Of EM Spectrum
 
Fdtd ppt for mine
Fdtd ppt   for mineFdtd ppt   for mine
Fdtd ppt for mine
 
Lecture 15 DCT, Walsh and Hadamard Transform
Lecture 15 DCT, Walsh and Hadamard TransformLecture 15 DCT, Walsh and Hadamard Transform
Lecture 15 DCT, Walsh and Hadamard Transform
 
Continuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGDContinuous and Discrete-Time Analysis of SGD
Continuous and Discrete-Time Analysis of SGD
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Dcs lec02 - z-transform
Dcs   lec02 - z-transformDcs   lec02 - z-transform
Dcs lec02 - z-transform
 
MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化MLP輪読スパース8章 トレースノルム正則化
MLP輪読スパース8章 トレースノルム正則化
 
03 image transform
03 image transform03 image transform
03 image transform
 
Restricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theoryRestricted Boltzman Machine (RBM) presentation of fundamental theory
Restricted Boltzman Machine (RBM) presentation of fundamental theory
 
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systemsAdaptive dynamic programming algorithm for uncertain nonlinear switched systems
Adaptive dynamic programming algorithm for uncertain nonlinear switched systems
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Second Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-rollSecond Order Perturbations During Inflation Beyond Slow-roll
Second Order Perturbations During Inflation Beyond Slow-roll
 
Dynamic response of oscillators to general excitations
Dynamic response of oscillators to general excitationsDynamic response of oscillators to general excitations
Dynamic response of oscillators to general excitations
 
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
QMC Program: Trends and Advances in Monte Carlo Sampling Algorithms Workshop,...
 
Ch13
Ch13Ch13
Ch13
 
Chris Sherlock's slides
Chris Sherlock's slidesChris Sherlock's slides
Chris Sherlock's slides
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for control
 
Dynamic response of structures with uncertain properties
Dynamic response of structures with uncertain propertiesDynamic response of structures with uncertain properties
Dynamic response of structures with uncertain properties
 
Linear response theory and TDDFT
Linear response theory and TDDFT Linear response theory and TDDFT
Linear response theory and TDDFT
 

Viewers also liked

岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02goony0101
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)Taiji Suzuki
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Taiji Suzuki
 
卒論プレゼンテーション -DRAFT-
卒論プレゼンテーション -DRAFT-卒論プレゼンテーション -DRAFT-
卒論プレゼンテーション -DRAFT-Tomoshige Nakamura
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01goony0101
 
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定Akiyoshi Hara
 
Introduction of "the alternate features search" using R
Introduction of  "the alternate features search" using RIntroduction of  "the alternate features search" using R
Introduction of "the alternate features search" using RSatoshi Kato
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoSatoshi Kato
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)Yasunori Ozaki
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門Hideo Terada
 

Viewers also liked (13)

岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02岩波データサイエンス_Vol.5_勉強会資料02
岩波データサイエンス_Vol.5_勉強会資料02
 
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
統計的学習理論チュートリアル: 基礎から応用まで (Ibis2012)
 
Sparse estimation tutorial 2014
Sparse estimation tutorial 2014Sparse estimation tutorial 2014
Sparse estimation tutorial 2014
 
FOBOS
FOBOSFOBOS
FOBOS
 
卒論プレゼンテーション -DRAFT-
卒論プレゼンテーション -DRAFT-卒論プレゼンテーション -DRAFT-
卒論プレゼンテーション -DRAFT-
 
岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01岩波データサイエンス_Vol.5_勉強会資料01
岩波データサイエンス_Vol.5_勉強会資料01
 
Jokyokai
JokyokaiJokyokai
Jokyokai
 
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定
スパース性に基づく機械学習 4.1 ノイズなしL1ノルム最小化問題の問題設定
 
Introduction of "the alternate features search" using R
Introduction of  "the alternate features search" using RIntroduction of  "the alternate features search" using R
Introduction of "the alternate features search" using R
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslasso
 
Ibis2016
Ibis2016Ibis2016
Ibis2016
 
PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)PRMLの線形回帰モデル(線形基底関数モデル)
PRMLの線形回帰モデル(線形基底関数モデル)
 
スパースモデリング入門
スパースモデリング入門スパースモデリング入門
スパースモデリング入門
 

Similar to Stochastic Alternating Direction Method of Multipliers

Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Shizuoka Inst. Science and Tech.
 
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...asahiushio1
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsPantelis Sopasakis
 
Smoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdfSmoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdfkeansheng
 
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...Valerio Salvucci
 
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Geoffrey Négiar
 
Slides_ICML v6.pdf
Slides_ICML v6.pdfSlides_ICML v6.pdf
Slides_ICML v6.pdfQiuweiLi1
 
A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...narmo
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang
 
Lecture 1
Lecture 1Lecture 1
Lecture 1butest
 
Universal Prediction without assuming either Discrete or Continuous
Universal Prediction without assuming either Discrete or ContinuousUniversal Prediction without assuming either Discrete or Continuous
Universal Prediction without assuming either Discrete or ContinuousJoe Suzuki
 
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...asahiushio1
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDBenjamin Jaedon Choi
 
Important formulas for JEE Main 2020 - Subject-wise List
Important formulas for JEE Main 2020 - Subject-wise ListImportant formulas for JEE Main 2020 - Subject-wise List
Important formulas for JEE Main 2020 - Subject-wise ListKanhaMalik
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksStratio
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsJagadeeswaran Rathinavel
 

Similar to Stochastic Alternating Direction Method of Multipliers (20)

Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
 
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...
2017-07, Research Seminar at Keio University, Metric Perspective of Stochasti...
 
Distributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUsDistributed solution of stochastic optimal control problem on GPUs
Distributed solution of stochastic optimal control problem on GPUs
 
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
QMC: Transition Workshop - Applying Quasi-Monte Carlo Methods to a Stochastic...
 
Smoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdfSmoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdf
 
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...
Improving EV Lateral Dynamics Control Using Infinity Norm Approach with Close...
 
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
Stochastic Frank-Wolfe for Constrained Finite Sum Minimization @ Montreal Opt...
 
Slides_ICML v6.pdf
Slides_ICML v6.pdfSlides_ICML v6.pdf
Slides_ICML v6.pdf
 
A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...A study of the worst case ratio of a simple algorithm for simple assembly lin...
A study of the worst case ratio of a simple algorithm for simple assembly lin...
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Universal Prediction without assuming either Discrete or Continuous
Universal Prediction without assuming either Discrete or ContinuousUniversal Prediction without assuming either Discrete or Continuous
Universal Prediction without assuming either Discrete or Continuous
 
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
2017-03, ICASSP, Projection-based Dual Averaging for Stochastic Sparse Optimi...
 
WITMSE 2013
WITMSE 2013WITMSE 2013
WITMSE 2013
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
 
Important formulas for JEE Main 2020 - Subject-wise List
Important formulas for JEE Main 2020 - Subject-wise ListImportant formulas for JEE Main 2020 - Subject-wise List
Important formulas for JEE Main 2020 - Subject-wise List
 
Article 1
Article 1Article 1
Article 1
 
Introduction to Artificial Neural Networks
Introduction to Artificial Neural NetworksIntroduction to Artificial Neural Networks
Introduction to Artificial Neural Networks
 
cheb_conf_aksenov.pdf
cheb_conf_aksenov.pdfcheb_conf_aksenov.pdf
cheb_conf_aksenov.pdf
 
SIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithmsSIAM - Minisymposium on Guaranteed numerical algorithms
SIAM - Minisymposium on Guaranteed numerical algorithms
 

More from Taiji Suzuki

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...Taiji Suzuki
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...Taiji Suzuki
 
深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点Taiji Suzuki
 
Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Taiji Suzuki
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理Taiji Suzuki
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理Taiji Suzuki
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習Taiji Suzuki
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論Taiji Suzuki
 

More from Taiji Suzuki (9)

[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
[ICLR2021 (spotlight)] Benefit of deep learning with non-convex noisy gradien...
 
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
[NeurIPS2020 (spotlight)] Generalization bound of globally optimal non convex...
 
深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点深層学習の数理:カーネル法, スパース推定との接点
深層学習の数理:カーネル法, スパース推定との接点
 
Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...Iclr2020: Compression based bound for non-compressed network: unified general...
Iclr2020: Compression based bound for non-compressed network: unified general...
 
数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理数学で解き明かす深層学習の原理
数学で解き明かす深層学習の原理
 
深層学習の数理
深層学習の数理深層学習の数理
深層学習の数理
 
はじめての機械学習
はじめての機械学習はじめての機械学習
はじめての機械学習
 
機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論機械学習におけるオンライン確率的最適化の理論
機械学習におけるオンライン確率的最適化の理論
 
Jokyokai2
Jokyokai2Jokyokai2
Jokyokai2
 

Recently uploaded

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxAleenaJamil4
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...ssuserf63bd7
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhYasamin16
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanMYRABACSAFRA2
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 217djon017
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesTimothy Spann
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...Amil Baba Dawood bangali
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Boston Institute of Analytics
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Cathrine Wilhelmsen
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改yuu sss
 

Recently uploaded (20)

Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Heart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis ProjectHeart Disease Classification Report: A Data Analysis Project
Heart Disease Classification Report: A Data Analysis Project
 
detection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptxdetection and classification of knee osteoarthritis.pptx
detection and classification of knee osteoarthritis.pptx
 
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
Statistics, Data Analysis, and Decision Modeling, 5th edition by James R. Eva...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhhThiophen Mechanism khhjjjjjjjhhhhhhhhhhh
Thiophen Mechanism khhjjjjjjjhhhhhhhhhhh
 
Identifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population MeanIdentifying Appropriate Test Statistics Involving Population Mean
Identifying Appropriate Test Statistics Involving Population Mean
 
Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2Easter Eggs From Star Wars and in cars 1 and 2
Easter Eggs From Star Wars and in cars 1 and 2
 
RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.RABBIT: A CLI tool for identifying bots based on their GitHub events.
RABBIT: A CLI tool for identifying bots based on their GitHub events.
 
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming PipelinesConf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
Conf42-LLM_Adding Generative AI to Real-Time Streaming Pipelines
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
NO1 Certified Black Magic Specialist Expert Amil baba in Lahore Islamabad Raw...
 
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
Decoding the Heart: Student Presentation on Heart Attack Prediction with Data...
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)Data Factory in Microsoft Fabric (MsBIP #82)
Data Factory in Microsoft Fabric (MsBIP #82)
 
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
专业一比一美国俄亥俄大学毕业证成绩单pdf电子版制作修改
 

Stochastic Alternating Direction Method of Multipliers

  • 1. Stochastic Alternating Direction Method of Multipliers Taiji Suzuki Tokyo Institute of Technology 1 / 49
  • 2. Goal of this study Machine learning problems with a structured regularization. min w2Rp 1 n Σn i=1 fi (z⊤ i w) + (B⊤w) We propose stochastic optimization methods to solve the above problem. Stochastic optimization + Alternating Direction Method of Multipliers (ADMM) Advantageous points Few samples are needed for each update. (We don't need to read all the samples for each iteration.) Wide range of applications Easy implementation Nice convergence properties Parallelizable 2 / 49
  • 3. Proposed method and existing methods Stochastic methods for regularized learning problems. Normal ADMM Online Proximal gradient type (Nesterov, 2007) Online-ADMM (Wang and Banerjee, 2012) FOBOS (Duchi and Singer, 2009) OPG-ADMM (Suzuki, 2013; Ouyang et al., 2013) Dual averaging type (Nesterov, 2009) RDA-ADMM (Suzuki, 2013) RDA (Xiao, 2009) Batch SDCA(Shalev-Shwartz and Zhang, 2013b) SDCA-ADMM (Suzuki, 2014) (Stochastic Dual Coordinate Ascent) SAG (Le Roux et al., 2013) (Stochastic Averaging Gradient) SVRG (Johnson and Zhang, 2013) (Stochastic Variance Reduced Gradient) Online type: O(1= p T) in general, O(log(T)=T) for a strongly convex objective. Batch type: linear convergence. 3 / 49
  • 4. Outline 1 Problem settings 2 Stochastic ADMM for online data 3 Stochastic ADMM for batch data 4 / 49
  • 5. High dimensional data analysis Bio-informatics Text data Image data 5 / 49
  • 6. High dimensional data analysis Redundant information deteriorates the estimation accuracy. Bio-informatics Text data Image data 5 / 49
  • 7. Sparse estimation Cut off redundant information → sparsity R. Tsibshirani (1996). Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267{288. Citation: 10185 (25th/May/2014) 6 / 49
  • 8. Regularized learning problem Lasso: min w2Rp 1 n Σn i=1 ⊤ i w)2 + ∥w∥| {z }1 (yi z : regularization Sparsity: Useful for high dimensional learning problem. Sparsity inducing regularization is usually non-smooth. 7 / 49
  • 9. Regularized learning problem Lasso: min w2Rp 1 n Σn i=1 ⊤ i w)2 + ∥w∥| {z }1 (yi z : regularization Sparsity: Useful for high dimensional learning problem. Sparsity inducing regularization is usually non-smooth. General regularized learning problem: min w2Rp 1 n Σn i=1 ⊤ i w) + ~ (w): fi (z 7 / 49
  • 10. Proximal mapping Regularized learning problem: min w2Rp 1 n Σn i=1 ⊤ i w) + ~ (w): fi (z gt 2 @w ( 1 n Σn i=1 ft (z⊤ ) jw=wt ; gt = 1 t w) t Σt =1 g : Proximal gradient descent: wt+1 = arg minw2Rp { g⊤ t w + ~ (w) + 1 2t ∥w wt∥2 } : Regularized dual averaging: wt+1 = arg minw2Rp { g⊤ t w + ~ (w) + 1 2t ∥w∥2 } : A key computation is the proximal mapping: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : 8 / 49
  • 11. Example of Proximal mapping: ℓ1 regularization prox(qjC∥ ∥1) = arg min x { C∥x∥1 + 1 2 ∥x q∥2 } = (sign(qj ) max(jqj j C; 0))j : ! Soft-thresholding function. Analytic form! 9 / 49
  • 12. Motivation Structured regularization with large sample min w2Rp 1 n Σn i=1 fi (z⊤ i w) | {z } large samples + ~ | {(wz }) complicated Difficulties (complex regularization and many samples) Proximal mapping can not be easily computed: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : Computation on the loss function is heavy for large number of samples. 10 / 49
  • 13. Motivation Structured regularization with large sample min w2Rp 1 n Σn i=1 fi (z⊤ i w) | {z } large samples + ~ | {(wz }) complicated Difficulties (complex regularization and many samples) Proximal mapping can not be easily computed: prox(qj ~ ) := arg min x { ~ (x) + 1 2 ∥x q∥2 } : ! ADMM (Alternating Direction Multiplier Method). Computation on the loss function is heavy for large number of samples. ! Stochastic optimization. 10 / 49
  • 14. Examples Overlapped group lasso ~ (w) = C Σ g2G ∥wg∥ The groups have over- laps. It is difficult to compute the proximal mapping. 11 / 49
  • 15. Examples Overlapped group lasso ~ (w) = C Σ g2G ∥wg∥ The groups have over- laps. It is difficult to compute the proximal mapping. Solution: Prepare for which proximal mapping is easily computable. Let (B⊤w) = ~ (w), and utilize the proximal mapping w.r.t. . BTw 11 / 49
  • 16. Decomposition technique Decompose into independent groups: B ⊤ w = (y) = C Σ g′2G′ ∥yg′∥ prox(qj ) = ( qg′max { 1 C }) ∥qg′∥; 0 g′2G′ Graph regularization (Jacob et al., 2009), Low rank tensor estimation (Signoretto et al., 2010; Tomioka et al., 2011) Dictionary learning (Kasiviswanathan et al., 2012; Rakotomamonjy, 2013). 12 / 49
  • 17. Another example Graph regularization (Jacob et al., 2009) ~ (w) = C Σ (i ;j)2E jwi wj j: (y) = C Σ e2E jye j; B ⊤ w = (wi wj )(i ;j)2E ) 8 : (B⊤w) = ~ (w); prox(qj ) = ( qemax { 1 C jqe }) j ; 0 e2E : Soft-Thresholding function. 13 / 49
  • 18. Optimizing composite objective function min w 1 n Σn i=1 ⊤ i w) + (B fi (z ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: Augmented Lagrangian L(w; y; ) = 1 n Σ i fi (z ⊤ i w) + (y) + ⊤ (y B ⊤ w) + 2 ∥y B ⊤ w∥2 sup infw;y L(w; y; ) yields the optimization of the original problem. The augmented Lagrangian is the basis of the method of multipliers (Hestenes, 1969; Powell, 1969; Rockafellar, 1976). 14 / 49
  • 19. ADMM min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: L(w; y; ) = 1 n Σ fi (z⊤ i w) + (y) + ⊤(y B⊤w) + 2 ∥y B⊤w∥2. ADMM (Gabay and Mercier, 1976) wt = arg min w 1 n Σn i=1 ⊤ i w) + fi (z ⊤ t (B ⊤ w) + 2 ∥yt1 B ⊤ w∥2 yt = arg min y ⊤ t y + (y) + 2 ∥y B ⊤ wt∥2(= prox(B ⊤ wt t=j =)) t = t1 (B ⊤ wt yt ) The computational difficulty of proximal mapping is resolved. 15 / 49
  • 20. ADMM min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) , min w;y 1 n Σn i=1 ⊤ i w) + (y) s:t: y = B fi (z ⊤ w: L(w; y; ) = 1 n Σ fi (z⊤ i w) + (y) + ⊤(y B⊤w) + 2 ∥y B⊤w∥2. ADMM (Gabay and Mercier, 1976) wt = arg min w 1 n Σn i=1 ⊤ i w) + fi (z ⊤ t (B ⊤ w) + 2 ∥yt1 B ⊤ w∥2 yt = arg min y ⊤ t y + (y) + 2 ∥y B ⊤ wt∥2(= prox(B ⊤ wt t=j =)) t = t1 (B ⊤ wt yt ) The computational difficulty of proximal mapping is resolved. However, the computation of 1n Σn i=1 fi (z⊤ i w) is still heavy. ! Stochastic version of ADMM. Online type: Suzuki (2013); Wang and Banerjee (2012); Ouyang et al. (2013) Batch type: Suzuki (2014) ! linear convergence. 15 / 49
  • 21. Outline 1 Problem settings 2 Stochastic ADMM for online data 3 Stochastic ADMM for batch data 16 / 49
  • 22. Online setting z t-1 z t w t w t+1 P(z) 2EVHUYH 8SGDWH 2EVHUYH 8SGDWH min w 1 T ΣT t=1 f (zt ;w) + ~ (w) We don't want to read all the samples. wt is updated per one (or few) observation. 17 / 49
  • 23. Existing oneline optimization method gt 2 @ft (wt ); gt = 1t Σt =1 g (ft (w) = f (zt ;w)) OPG (Online Proximal Gradient Descent) wt+1 = arg min w { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w wt∥2 } RDA (Regularized Dual Averaging; Xiao (2009); Nesterov (2009)) wt+1 = arg min w { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w∥2 } These update rule is computed by a proximal mapping associated with ~ . Efficient for a simple regularization such as ℓ1 regularization. How about structured regularization? ! ADMM. 18 / 49
  • 24. Proposed method 1: RDA-ADMM Ordinary RDA: wt+1 = arg minw { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w∥2 } RDA-ADMM Let xt = 1t Σt =1 x ; t = 1t Σt =1 ; yt = 1t Σt =1 y ; gt = 1t Σt =1 g : xt+1 =arg min x2X { g ⊤ t x ⊤ t B ⊤ x + 2t ∥B ⊤ x∥2 +(B ⊤ xt yt ) ⊤ B ⊤ x + 1 2t ∥x∥2 Gt } ; yt+1 =arg min y2Y { (y) ⊤ t (B ⊤ xt+1 y) + 2 ∥B ⊤ xt+1 y∥2 } ; t+1 =t (B ⊤ xt+1 yt+1): The update rule of yt+1 and xt+1 are same as the ordinary ADMM. yt+1 = prox(B ⊤ xt+1 t=j ) 19 / 49
  • 25. Proposed method 2: OPG-ADMM Ordinary OPG: wt+1 = arg minw { ⟨gt ;w⟩ + ~ (w) + 1 2t ∥w wt∥2 } : OPG-ADMM xt+1 =arg min x2X { g ⊤ t x ⊤ t (B ⊤ x yt ) + 2 ∥B ⊤ x yt∥2 + 1 2t ∥x xt∥2 Gt } ; yt+1 =prox(B ⊤ xt+1 t=j ); t+1 =t (B ⊤ xt+1 yt+1): The update rule of yt+1 and xt+1 are same as the ordinary ADMM. 20 / 49
  • 27. ed version Letting Gt be a speci
  • 28. c form, the update rule is much simpli
  • 29. ed. (Gt = I t tBB⊤ for RDA-ADMM, and Gt = I tBB⊤ for OPG-ADMM) Online stochastic ADMM 1 Update of x (RDA-ADMM) xt+1 = X [ t fgt B( t B ] ; ⊤ xt + yt )g (OPG-ADMM) xt+1 = X [ t fgt B(t B ⊤ xt + yt )g + xt ] : 2 Update of y yt+1 = prox(B ⊤ xt+1 t=j ): 3 Update of t+1 = t (B ⊤ xt+1 yt+1): Fast computation. Easy implementation. 21 / 49
  • 30. Convergence analysis We bound the expected risk: Expected risk F(x) = EZ [f (Z; x)] + ~ (x): Assumptions: (A1) 9G s.t. 8g 2 @x f (z; x) satis
  • 31. es ∥g∥ G for all z; x. (A2) 9L s.t. 8g 2 @ (y) satis
  • 32. es ∥g∥ L for all y. (A3) 9R s.t. 8x 2 X satis
  • 33. es ∥x∥ R. 22 / 49
  • 34. Convergence rate: bounded gradient (A1) 9G s.t. 8g 2 @x f (z; x) satis
  • 35. es ∥g∥ G for all z; x. (A2) 9L s.t. 8g 2 @ (y) satis
  • 36. es ∥g∥ L for all y. (A3) 9R s.t. 8x 2 X satis
  • 37. es ∥x∥ R. Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), we have Ez1:T1 [F(xT ) F(x )] 1 T ΣT t=2 t1 2(t 1) G2 + T ∥x ∥2 + K T : Theorem (Convergence rate of OPG-ADMM) Under (A1), (A2), (A3), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 max { t t1 ; 0 } R2 + 1T ΣT t=1 t 2 G2 + K T : Both methods have convergence rate O ( p1 T ) by letting t = 0 p t for RDA-ADMM and t = 0= p t for OPG-ADMM. 23 / 49
  • 38. Convergence rate: strongly convex (A4) There exist f ; 0 and P;Q ⪰ O such that EZ [f (Z; x) + (x ′ x) ⊤∇x f (Z; x)] + f 2 ∥x x ′∥2 P EZ [f (Z; x ′ )]; (y) + (y ′ y) ⊤∇ (y) + 2 ∥y y ′∥2 Q; for all x; x′ 2 X and y; y′ 2 Y, and 9 0 satisfying I ⪯ f P + 2 + Q: The update rule of RDA-ADMM is modi
  • 39. ed as xt+1 = arg min x2X { g ⊤ t x ⊤ t B ⊤ x + 2t ∥B ⊤ x∥2 + (B ⊤ xt yt ) ⊤ B ⊤ x + 1 2t ∥x∥2 Gt + 2 ∥x xt∥2 } : 24 / 49
  • 40. Convergence analysis: strongly convex Theorem (Convergence rate of RDA-ADMM) Under (A1), (A2), (A3), (A4), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 1 t1 t1 + t G2 + T ∥x ∥2 + K T : Theorem (Convergence rate of OPG-ADMM) Under (A1), (A2), (A3), (A4), we have Ez1:T1 [F(xT ) F(x )] 1 2T ΣT t=2 max { t t1 } R2 ; 0 + 1 T ΣT t=1 t 2 G2 + K T : Both methods have convergence rate O ( log(T) T ) by letting t = 0t for RDA-ADMM and t = 0=t for OPG-ADMM. 25 / 49
  • 41. Numerical experiments 10 1 10 2 10 3 10 -2 10 -1 10 0 CPU time (s) Expected Risk OPG−ADMM RDA−ADMM NL−ADMM RDA OPG Figure: Arti
  • 42. cial data: 1024 dim, 512 sample, Overlapped group lasso. 10 0 10 1 10 2 10 3 10 -0.8 10 -0.7 10 -0.6 CPU time (s) Classification Error OPG−ADMM RDA−ADMM NL−ADMM RDA Figure: Real data (Adult, a9a): 15,252 dim, 32,561 sample, Overlapped group lasso + ℓ1 reg. 26 / 49
  • 43. Outline 1 Problem settings 2 Stochastic ADMM for online data 3 Stochastic ADMM for batch data 27 / 49
  • 44. Batch setting In the batch setting, the data are
  • 45. xed. We just minimize the objective function de
  • 46. ned by 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w): We construct a method that observes few samples per iteration (like online method), converges linearly (unlike online method). 28 / 49
  • 47. Dual problem The dual formulation is utilized to obtain our algorithm. Primal: min w 1 n Σn i=1 fi (z ⊤ i w) + (B ⊤ w) Dual problem min x2Rn;y2Rd 1 n Σn i=1 f i (xi ) + ( y n ) s:t: Zx + By = 0: where Z = [z1; z2; : : : ; zn] 2 Rpn. Optimality condition: ⊤ i w z 2 ∇f i (x i ); 1 n y 2 ∇ (u)j u=B⊤w ; Zx + By = 0: ⋆ Each coordinate xi corresponds to each sample zi . 29 / 49
  • 48. Dual of a loss function loss function 0 0 1 ℓ(y; x) = log(1 + exp(yx)) ℓ(y; u) = yu log(yu) + (1 + yu) log(1 + yu) 30 / 49
  • 49. Dual of regularization functions regularization function 0 ℓ1 and elastic-net regularization dual 31 / 49
  • 50. Proposed algorithm Basic algorithm Split the index set f1; : : : ; ng into K groups (I1; I2; : : : ; IK). For each t = 1; 2; : : : Choose k 2 f1; : : : ;Kg uniformly at random, and set I = Ik . y(t) arg min y { n (y=n) ⟨w(t1); Zx(t1)+By⟩ + 2 ∥Zx(t1) + By∥2 + 1 2 ∥y y(t1)∥2 Q } ; x(t) I arg min xI {Σ i (xi ) ⟨w(t1); ZI xI + By(t)⟩ i2I f + 2 ∥ZI xI +ZnI x(t1) nI +By(t)∥2+ 1 2 ∥xI x(t1) I ∥2 GI ;I } ; w(t) w(t1) fn(Zx(t) + By(t)) (n n=K)(Zx(t1) + By(t1))g: where Q; G are some appropriate positive semide
  • 52. We observe only a few samples at each iteration. Mini-batch technique (Takac et al., 2013; Shalev-Shwartz and Zhang, 2013a). If K = n, we need to observe only one sample at each iteration. 33 / 49
  • 54. ed algorithm Setting Q = (BId B ⊤ B); GI ;I = (Z;I IjI j Z ⊤ I ZI ); then, by the relation prox(qj ) + prox(qj ) = q, we obtain the following update rule: Simpli
  • 55. ed algorithm For q(t) = y(t1) + B ⊤ B fw(t1) (Zx(t1) + By(t1))g, let y(t) q(t) prox(q(t)jn (B )=(B)); I = x(t1) I + Z For p(t) ⊤ I Z;I fw(t1) (Zx(t1) + By(t))g, let x(t) i prox(p(t) i i =(Z;I )) (8i 2 I ): jf ⋆ The update of x can be parallelized. 34 / 49
  • 56. Convergence Analysis x: the optimal variable of x Y: the set of optimal variables (not necessarily unique) w: the optimal variable of w Assumption: There exits v 0 such that, 8xi 2 R, f i (xi ) f i (x i ) ⟨∇f i (x i ); xi x i ⟩ + v∥xi x i ∥2 2 : 9h; v 0 such that, for all y; u, there exists by 2 Y such that (y=n) (by =n) ⟨B ⊤ w ; y=n by =n⟩ + v 2 ∥PKer(B)(y=n by =n)∥2; (u) (B ⊤ w ) ⟨y =n; u B ⊤ w ⟩ + h 2 ∥u B ⊤ w ∥2: B⊤ is injective. 35 / 49
  • 57. FD(x; y) := 1 n Σn i=1 f i (xi ) + ( y n ) ⟨w; Z x n B y n ⟩: RD(x; y;w) := FD(x; y) FD(x; y) + 1 2n2 ∥w w∥2 + (1 ) 2n ∥Zx + By∥2 + 1 2n ∥x x∥2v Ip+H + 1 2nK ∥y y∥2 Q: Theorem (Linear convergence of SDCA-ADMM) Let H be a matrix such that HI ;I = Z⊤ I ZI + GI ;I for all I 2 fI1; : : : ; IK g, = min { v ; hmin(B 4(v+max(H)) ⊤ B) 2 maxf1=n;4h;4hmax(Q) g ; Kv =n 4max(Q) ; Kvmin(BB ⊤ ) 4max(Q)(max(Z⊤Z)+4v) } ; = 1 4n ; and C1 = RD(x(0); y(0);w(0)); then we have that (dual residual) E[RD(x(t); y(t);w(t))] ( 1 K )t C1; (primal variable) E[∥w(t) w ∥2] n 2 ( 1 K )t C1: For t C′ K log (C′′n ϵ ) , we have that E[∥w(t) w∥2] ϵ. 36 / 49
  • 58. Numerical Experiments: Loss function Binary classi
  • 59. cation. Smoothed hinge loss: fi (u) = 8 : 0; (yiu 1); 1 yiu; (yiu 0); 2 1 2 (1 yiu)2; (otherwise): 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 -0.5 0 0.5 1 1.5 ⇒ proximal mapping is analytically obtained. prox(ujf i =C) = 8 : Cuyi 1+C (1 Cuyi1 1+C 0); yi (1 Cuyi1 1+C ); 0 (otherwise): 37 / 49
  • 62. cial data: Overlapped group regularization: Σ32 ~ (X) = C( i=1 ∥X:;i∥ + Σ32 j=1 ∥Xj ;:∥ + 0:01 Σ i ;j X2 i ;j=2); X 2 R3232: 38 / 49
  • 64. cial data): Results 5 10 15 20 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 CPU time (s) Empirical Risk (a) Excess objective value 2.5 2 1.5 1 0.5 10 -1 10 0 10 1 0 CPU time (s) Test Loss (b) Test loss 0.3 0.25 0.2 0.15 0.1 10 -1 10 0 10 1 0.05 CPU time (s) Classification Error RDA OPG−ADMM RDA−ADMM OL−ADMM Batch−ADMM SDCA−ADMM (c) Class. Error Figure: Arti
  • 65. cial data (n=5,120). Overlapped group lasso. Mini-batch size n=K = 50. 39 / 49
  • 67. cial data): Results 100 200 300 400 500 10 -6 10 -4 10 -2 10 0 CPU time (s) Empirical Risk (a) Excess objective value 10 0 10 1 10 2 3.5 3 2.5 2 1.5 1 0.5 0 CPU time (s) Test Loss (b) Test loss 10 0 10 1 10 2 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 CPU time (s) Classification Error RDA OPG−ADMM RDA−ADMM OL−ADMM Batch−ADMM SDCA−ADMM (c) Class. Error Figure: Arti
  • 68. cial data (n=51,200). Overlapped group lasso. Mini-batch size n=K = 50. 40 / 49
  • 69. Numerical Experiments (Real data): Setting Real data: Graph regularization: ~ (w) = C1 Σp i=1 jwi j+C2 Σ (i ;j)2E jwi wj j+0:01(C1 Σp i=1 jwi j2+C2 Σ (i ;j)2E jwi wj j2) where E is the set of edges obtained from the similarity matrix. The similarity graph is obtained by Graphical Lasso (Yuan and Lin, 2007; Banerjee et al., 2008). 41 / 49
  • 70. Numerical Experiments (Real data): Results 5 10 15 20 25 30 35 40 45 0.175 0.174 0.173 0.172 0.171 0.17 0.169 0.168 0.167 0.166 CPU time (s) Empirical Risk (a) Objective function 10 0 10 1 0.19 0.188 0.186 0.184 0.182 0.18 0.178 0.176 0.174 0.172 0.17 CPU time (s) Test Loss (b) Test loss 10 0 10 1 0.14 0.138 0.136 0.134 0.132 0.13 0.128 0.126 0.124 0.122 0.12 CPU time (s) Classification Error OPG−ADMM RDA−ADMM OL−ADMM SDCA−ADMM (c) Class. Error Figure: Real data (20 Newsgroups, n=12,995). Graph regularization. Mini-batch size is n=K = 50. 42 / 49
  • 71. Numerical Experiments (Real data): Results 0 50 100 150 200 0.5 0.45 0.4 0.35 0.3 0.25 0.2 CPU time (s) Empirical Risk (a) Objective function 0 50 100 150 200 0.45 0.4 0.35 0.3 0.25 0.2 CPU time (s) Test Loss (b) Test loss 0 50 100 150 200 0.18 0.175 0.17 0.165 0.16 0.155 0.15 CPU time (s) Classification Error OPG−ADMM RDA−ADMM OL−ADMM SDCA−ADMM (c) Class. Error Figure: Real data (a9a, n=32,561). Graph regularization. Mini-batch size n=K = 50. 43 / 49
  • 72. Summary A new stochastic optimization methods with ADMM is proposed. Applicable to wide range of regularization functions. Easy implementation. Online setting: RDA-ADMM, OPG-ADMM O(1= p T) in general, and O(log(T)=T) for strongly convex objective. Batch setting: SDCA-ADMM (Stochastic dual coordinate descent ADMM) Linear convergence with strong convexity around the optimum. The convergence rate is controlled by the mini-batch size K. 44 / 49
  • 73. References I O. Banerjee, L. E. Ghaoui, and A. d'Aspremont. Model selection through sparse maximum likelihood estimation for multivariate gaussian or binary data. Journal of Machine Learning Research, 9:485{516, 2008. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: 2873{2908, 2009. D. Gabay and B. Mercier. A dual algorithm for the solution of nonlinear variational problems via
  • 74. nite-element approximations. Computers Mathematics with Applications, 2:17{40, 1976. M. Hestenes. Multiplier and gradient methods. Journal of Optimization Theory Applications, 4:303{320, 1969. 45 / 49
  • 75. References II R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 315{323. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/ 4937-accelerating-stochastic-gradient-descent-using-predictive-variance-reduction. pdf. N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for strongly-convex optimization with
  • 76. nite training sets. In Advances in Neural Information Processing Systems 25, 2013. Y. Nesterov. Gradient methods for minimizing composite objective function. Technical Report 76, Center for Operations Research and Econometrics (CORE), Catholic University of Louvain (UCL), 2007. 46 / 49
  • 77. References III Y. Nesterov. Primal-dual subgradient methods for convex problems. Mathematical Programming, Series B, 120:221{259, 2009. H. Ouyang, N. He, L. Q. Tran, and A. Gray. Stochastic alternating direction method of multipliers. In Proceedings of the 30th International Conference on Machine Learning, 2013. M. Powell. A method for nonlinear constraints in minimization problems. In R. Fletcher, editor, Optimization, pages 283{298. Academic Press, London, New York, 1969. R. T. Rockafellar. Augmented Lagrangians and applications of the proximal point algorithm in convex programming. Mathematics of Operations Research, 1:97{116, 1976. S. Shalev-Shwartz and T. Zhang. Accelerated mini-batch stochastic dual coordinate ascent. In Advances in Neural Information Processing Systems 26, 2013a. 47 / 49
  • 78. References IV S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. Technical report, 2013b. arXiv:1211.2717. T. Suzuki. Dual averaging and proximal gradient descent for online alternating direction multiplier method. In Proceedings of the 30th International Conference on Machine Learning, pages 392{400, 2013. T. Suzuki. Stochastic dual coordinate ascent with alternating direction method of multipliers. In Proceedings of the 31th International Conference on Machine Learning, pages 736{744, 2014. M. Takac, A. Bijral, P. Richtarik, and N. Srebro. Mini-batch primal and dual methods for SVMs. In Proceedings of the 30th International Conference on Machine Learning, 2013. H. Wang and A. Banerjee. Online alternating direction method. In Proceedings of the 29th International Conference on Machine Learning, 2012. 48 / 49
  • 79. References V L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. In Advances in Neural Information Processing Systems 23, 2009. M. Yuan and Y. Lin. Model selection and estimation in the gaussian graphical model. Biometrika, 94(1):19{35, 2007. 49 / 49