SlideShare a Scribd company logo
1 of 43
Download to read offline
Modeling the Dynamics of SGD by Stochastic
Differential Equation
Outlines
• Stochastic Gradient Descent (SGD)
• Stochastic Differential Equation (SDE)
• Continuous-time SGD & Controlled SGD
• Effects of SGD on Generalization
2
Outlines
• Stochastic Gradient Descent (SGD)
• Stochastic Differential Equation (SDE)
• Continuous-time SGD & Controlled SGD
• Effects of SGD on Generalization
3
Stochastic Gradient Descent (SGD)
• Gradient Descent v.s. Stochastic Gradient Descent
Minimum
Gradient Descent
Batch size = Dataset size
SGD
Small batch size
SGD
Large batch size
w : weights
⌘ : learning rate
B : batch
L(x, w) : loss function
w w ⌘
1
|B|
X
x2B
@L(x, w)
@w
4
Stochastic Gradient Descent (SGD)
• Convergence of SGD
• Assume that the loss function is convex
E[L(x, ¯w) L(x, w⇤
)]  o(
1
p
T
) SGD
¯w
Minimum
w⇤
The distance is
guaranteed to be small
T : step counts
¯w : w after T steps
w⇤
: w at the minimum of L
5
Stochastic Gradient Descent (SGD)
• Dynamics of SGD
Minimum
SGD
The process between the
starting point and the final
solution
6
Outlines
• Stochastic Gradient Descent (SGD)
• Stochastic Differential Equation (SDE)
• Continuous-time SGD & Controlled SGD
• Effects of SGD on Generalization
7
Random Walk
x
t = tt = t
1/2 probability1/2 probability
t = 0
0x
⇢
P(X t = x) = 1
2
P(X t = x) = 1
2
Position of the particle at t is a random variable X t such that
8
Random Walk
t = t
t = 2 t
1/2 1/2
x x0
x x0 2 x2 x
1/4 1/2 1/4
x x0 2 x2 x
1/8 3/8 1/8
t = 3 t
3/8
3 x 3 x
X t
X2 t
X3 t
9
Random Walk
Normal distribution
D =
( x)2
t
t = n t, n ! 1, t ! 0
Xt = N(0, Dt)
Diffusion coefficient
10
Stochastic Differential Equation (SDE)
• Ordinary Differential Equation
⇢ dx(t)
dt = b(x(t)), where t > 0
x(0) = x0
x0
x(t)
Trajectory of x
11
⇢ dx(t)
dt = b(x(t)) + B(x(t))dW (t)
dt , where t > 0 and W(t) is a Wiener process
x(0) = x0
Stochastic Differential Equation (SDE)
• Stochastic Differential Equation
x0
x(t)
Trajectory samples of x
Deterministic
part
Stochastic
part
(Random walk with
infinitely small step)
12
Stochastic Differential Equation (SDE)
• Solving Stochastic Differential Equation
⇢ dx(t)
dt = b(x(t)) + B(x(t))dW (t)
dt
x(0) = x0
⇢
dx(t) = b(x(t))dt + B(x(t))dW(t)
x(0) = x0
x(t) = x0 +
Z t
0
b(x(s))ds +
Z t
0
B(x(s))dW(s)
Multiply both sides by dt
Integrate both sides by dt
Stochastic integral
13
Stochastic Differential Equation (SDE)
• Solution of a Stochastic Integral is a Random Variable
If g : [0, 1] ! R is a deterministic function:
E
h Z 1
0
gdW
i
= 0 , and E
h
(
Z 1
0
gdW)2
i
=
Z 1
0
g2
dt
If G is a stochastic process such that E
h Z T
0
G2
dt
i
< 1 :
E
h Z T
0
GdW
i
= 0 , and E
h
(
Z T
0
GdW)2
i
= E
h Z T
0
G2
dt
i
mean variance
mean variance
14
Outlines
• Stochastic Gradient Descent (SGD)
• Stochastic Differential Equation (SDE)
• Continuous-time SGD & Controlled SGD
• Effects of SGD on Generalization
15
Continuous-time SGD & Controlled SGD
16
Continuous-time SGD & Controlled SGD
• Notation Conventions:
Gradient Descent : xk+1 = xk ⌘rf(xk)
Stochastic Gradient Descent : xk+1 = xk ⌘rf k
(xk)
f : loss function
xk : weights at step k
k : index of training sample at step k (assume batch size is 1)
fi : loss function calculated by batch i, where f(x) = (1/n)⌃n
i=1fi(x)
17
Continuous-time SGD & Controlled SGD
xk+1 xk = ⌘rf k
(xk)
xk+1 xk = ⌘rf(xk) +
p
⌘Vk
Deterministic
part
Stochastic
part minimum
Deterministic
partStochastic
part
Vk =
p
⌘(rf(xk) rf k
(xk))
mean of Vk : 0
covariance of Vk : ⌘⌃(xk),
where ⌃(xk) = (1/n)⌃n
i=1(rf(xk) rfi(xk))(rf(xk) rfi(xk))T
18
Continuous-time SGD & Controlled SGD
• Continuos-time SGD
xk+1 xk = ⌘rf(xk) +
p
⌘Vk
Convert to continuous time domain
dXt = rf(Xt)dt +
p
⌘⌃(Xt)dWt
dXt = r(f(Xt) +
⌘
4
|rf(Xt)|2
)dt +
p
⌘⌃(Xt)dWt
Continuos-time SGD,
Order 1 and Order 2
weak approximation
19
Continuous-time SGD & Controlled SGD
• A Toy Example
• Continuous-time SGD (order 2 weak approximation):
• Solution:
f(x) = x2
, f1(x) = (x 1)2
1, f2(x) = (x + 1)2
1
f2(x) f1(x)f(x)
dXt = 2(1 + ⌘)Xtdt + 2
p
⌘dWt
Xt ⇠ N(x0e 2(1+⌘)t
,
⌘
1 + ⌘
(1 e 4(1+⌘)t
))
20
Continuous-time SGD & Controlled SGD
Xt ⇠ N(x0e 2(1+⌘)t
,
⌘
1 + ⌘
(1 e 4(1+⌘)t
))
t
x
E[Xt] =
⇢
x0, when t = 0
0, when t ! 1
x0
Var[Xt] =
⇢
0, when t = 0
⌘
1+⌘ , when t ! 1
E[Xt⇤ ] =
p
Var[Xt⇤ ]
Fluctuations phaseDescent phase r
⌘
1 + ⌘
21
Continuous-time SGD & Controlled SGD
• Controlled SGD : Adaptive Hyper-parameter Adjustment
xk+1 = xk ⌘ukf0
(xk), where uk 2 [0, 1] is adjustment factor
Optimal Control
Formulation
min
ut
Ef(Xt) subject to :
dXt = utf0
(Xt)dt + ut
p
⌘⌃(Xt)dWt
22
Continuous-time SGD & Controlled SGD
• Quadratic Objective Function
• Continuous-time SGD:
• Optimal control policy :
dXt = aut(Xt b)dt + ut
p
⌘⌃dWt
f(x) =
1
2
a(x b)2
, assume the covariance of f0
i is ⌘⌃(x)
u⇤
t =
⇢
1 if a  0 or t  t⇤
1
1+a(t t⇤) if a > 0 and t > t⇤
23
Continuous-time SGD & Controlled SGD
• Optimal control policy
u⇤
t =
⇢
1 if a  0 or t  t⇤
, ( t  t⇤
is descent phase)
1
1+a(t t⇤) if a > 0 and t > t⇤
, ( t > t⇤
is fluctuations phase)
t
x
Fluctuations
phase
Descent
phase t⇤
a  0 a > 0
f(x) =
1
2
a(x b)2
, assume the covariance of f0
i is ⌘⌃(x)
24
Continuous-time SGD & Controlled SGD
• General Objective Function
f(x) and fi(x) is not necessarily quadratic, and x 2 Rd
assume f(x) ⇡
1
2
dX
i=1
a(i)(x(i) b(i))2
hold locally in x, and
⌃ ⇡ diag{⌃(1), ..., ⌃(d)} where each ⌃(i) is locally constant.
(each dimension is independent)
25
Continuous-time SGD & Controlled SGD
• Controlled SGD Algorithms
At each step k, estimate ak,(i), bk,(i) for
1
2
ak,(i)(xk,(i) bk,(i))2
.
Since rf(i) ⇡ a(i)(x(i) b(i)),
we use linear regression to estimate ak,(i), bk,(i):
1
2
ak,(i)(xk,(i) bk,(i))2
xk,(i)
xk 1,(i)ak,(i) =
gxk,(i) gk,(i)xk,(i)
x2
k,(i) x2
k,(i)
, and bk,(i) = xk,(i)
gk,(i)
ak,(i)
where gk,(i) = rf k
(xk)(i), and gk+1,(i) = k,(i)gk,(i) + (1 k,(i))gk,(i)
Exponential moving average
26
Continuous-time SGD & Controlled SGD
• Controlled SGD Algorithms
Solve the optimal control policy u⇤
k,(i) for
1
2
ak,(i)(xk,(i) bk,(i))2
u⇤
k,(i) =
(
1 if a  0,
min(1,
ak,(i)(¯xk,(i) bk,(i))2
⌘⌃k,(i)
) if a > 0 .
where ⌃k,(i) = g2
k,(i) ¯g2
k,(i)
27
MNIST
fully connected NN
CIFAR-10
fully connected NN
CIFAR-10
CNN
28
Continuous-time SGD & Controlled SGD
• Implementation of cSGD
• https://github.com/LiQianxiao/cSGD-cMSGD
29
Outlines
• Stochastic Gradient Descent (SGD)
• Random Walk, Diffusion and Wiener process
• Stochastic Differential Equation (SDE)
• Effects of SGD on Generalization
30
Effects of SGD on Generalization
ICANN 2018
31
Effects of SGD on Generalization
• Notation Conventions:
32
Loss function : L(✓) =
1
N
NX
n=1
l(✓, xn), where N is the size of dataset
GD : ✓k+1 = ✓k ⌘g(✓k), where g(✓) =
@L
@✓
SGD : ✓k+1 = ✓k ⌘g(S)
(✓k), where gS
(✓) =
1
S
X
n2B
@
@✓
l(✓, xn),
B is batch and S is batch size
Effects of SGD on Generalization
• Continuous-time SGD
d✓ = g(✓)dt +
r
⌘
S
R(✓)dW(t),
where R(✓)R(✓)T
= C(✓) and
C(✓)
S
is the covariance of
⇣
g(S)
(✓) g(✓)
⌘
33
Effects of SGD on Generalization
• Effects of different learning rate and batch size
Minimum
small ⌘
large ⌘
Minimum
small S
large S
d✓ = g(✓)dt +
r
⌘
S
R(✓)dW(t),
where R(✓)R(✓)T
= C(✓) and C(✓) is the covariance of g(✓)
34
Effects of SGD on Generalization
• Flat minimum v.s. Sharp minimum (https://arxiv.org/abs/1609.04836)
Loss
function
Loss function
(evaluated on
testing data)
Flat
minimum
Sharp
minimum
High
testing
error
Low
testing
error
35
Effects of SGD on Generalization
• Effects of learning rate / batch size on generalization
d✓ = g(✓)dt +
r
⌘
S
R(✓)dW(t),
where R(✓)R(✓)T
= C(✓) and C(✓) is the covariance of g(✓)
Flat
Minimum
Sharp
Minimum
Flat
Minimum
Sharp
Minimum
large
r
⌘
S
, (large ⌘, small S)small
r
⌘
S
, (small ⌘, large S)
36
Effects of SGD on Generalization
• Theoretical Explanation
• Assumption 1 : the loss surface can be approximated by a quadratic bowl, with
minimum at zero loss.
• Assumption 2 : The covariance of the gradients and the Hessian of the loss
approximation are approximately equal, i.e. C = H.
37
Effects of SGD on Generalization
• Theoretical Explanation
d✓ = g(✓)dt +
r
⌘
S
R(✓)dW(t),
where R(✓)R(✓)T
= C(✓) and C(✓) is the covariance of g(✓)
dz = ⇤zdt +
r
⌘
S
p
⇤dW(t)
Change of variables:
z : New variable, where z = V T
(✓ ✓⇤
)
✓⇤
: The parameters at the minimum
V : Orthogonal matrix of the eigen decomposition H = V ⇤V T
H : The Hession of L(✓)
38
Effects of SGD on Generalization
• Theoretical Explanation
• Expectation of loss function
dz = ⇤zdt +
r
⌘
S
p
⇤dW(t)
E(L) =
1
2
qX
i=1
iE(z2
i ) =
⌘
4S
Tr(⇤) =
⌘
4S
Tr(H)
Ornstein-Unhlenbeck process for z, solution : E[z] = 0 and cov[z] =
⌘
2S
I
39
• Theoretical Explanation
Effects of SGD on Generalization
E =
1
2
qX
i=1
iE(z2
i ) =
⌘
4S
Tr(⇤) =
⌘
4S
Tr(H)
=)
E(L)
Tr(H)
/
⌘
S
low
⌘
S
Sharp minimum : low
E(L)
Tr(H)
Flat munimum : high
E(L)
Tr(H)
(Minima with similar loss values)
high
⌘
S
40
Increasing LR/BS,
Increasing accuracy
Similar LR/BS,
Similar accuracy
41
Tips for Tuning Batch Size and Learning
Rate
• Learning rate can be decayed when epoch increases.
• Learning rate should not be initialized from small value.
• To keep the validation accuracy, LR/BS should remain constant when
changing batch size.
• To achieve higher validation accuracy, increase learning rate or reduce
batch size.
42
Further Readings
• An Introduction to Stochastic Differential Equations
• http://ft-
sipil.unila.ac.id/dbooks/AN%20INTRODUCTION%20TO%20STOCHASTIC
%20DIFFERENTIAL%20EQUATIONS%20VERSION%201.2.pdf
• Stochastic Modified Equations and Adaptive Stochastic Gradient
Algorithms
• https://arxiv.org/abs/1511.06251
• Three Factors Influencing Minima in SGD
• https://arxiv.org/abs/1710.11029
43

More Related Content

What's hot

Extension principle
Extension principleExtension principle
Extension principleSavo Delić
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel LearningMasahiro Suzuki
 
人工知能2018 強化学習の応用
人工知能2018 強化学習の応用人工知能2018 強化学習の応用
人工知能2018 強化学習の応用Hirotaka Hachiya
 
熱力学の初歩まとめ 20200224 吉田正俊
熱力学の初歩まとめ 20200224 吉田正俊熱力学の初歩まとめ 20200224 吉田正俊
熱力学の初歩まとめ 20200224 吉田正俊Masatoshi Yoshida
 
Partial differentiation
Partial differentiationPartial differentiation
Partial differentiationTanuj Parikh
 
Differential equations of first order
Differential equations of first orderDifferential equations of first order
Differential equations of first orderUzair Saiyed
 
Solved exercises line integral
Solved exercises line integralSolved exercises line integral
Solved exercises line integralKamel Attar
 
Fixed point iteration
Fixed point iterationFixed point iteration
Fixed point iterationIsaac Yowetu
 
統計的学習の基礎 4章 前半
統計的学習の基礎 4章 前半統計的学習の基礎 4章 前半
統計的学習の基礎 4章 前半Ken'ichi Matsui
 
量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析Shu Tanaka
 
統計的学習の基礎 4.4~
統計的学習の基礎 4.4~統計的学習の基礎 4.4~
統計的学習の基礎 4.4~Atsushi Hayakawa
 
ディジタル信号処理 課題解説 その5
ディジタル信号処理 課題解説 その5ディジタル信号処理 課題解説 その5
ディジタル信号処理 課題解説 その5noname409
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門ShoShimoyama
 
Lecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual InformationLecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual Informationssuserb83554
 
Orthogonal sets and basis
Orthogonal sets and basisOrthogonal sets and basis
Orthogonal sets and basisPrasanth George
 
PRML 上 1.2.4 ~ 1.2.6
PRML 上 1.2.4 ~ 1.2.6PRML 上 1.2.4 ~ 1.2.6
PRML 上 1.2.4 ~ 1.2.6禎晃 山崎
 

What's hot (20)

Extension principle
Extension principleExtension principle
Extension principle
 
Bisection method
Bisection method Bisection method
Bisection method
 
(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning(DL hacks輪読) Deep Kernel Learning
(DL hacks輪読) Deep Kernel Learning
 
人工知能2018 強化学習の応用
人工知能2018 強化学習の応用人工知能2018 強化学習の応用
人工知能2018 強化学習の応用
 
熱力学の初歩まとめ 20200224 吉田正俊
熱力学の初歩まとめ 20200224 吉田正俊熱力学の初歩まとめ 20200224 吉田正俊
熱力学の初歩まとめ 20200224 吉田正俊
 
Partial differentiation
Partial differentiationPartial differentiation
Partial differentiation
 
Differential equations of first order
Differential equations of first orderDifferential equations of first order
Differential equations of first order
 
Solved exercises line integral
Solved exercises line integralSolved exercises line integral
Solved exercises line integral
 
Fixed point iteration
Fixed point iterationFixed point iteration
Fixed point iteration
 
統計的学習の基礎 4章 前半
統計的学習の基礎 4章 前半統計的学習の基礎 4章 前半
統計的学習の基礎 4章 前半
 
First order ordinary differential equations and applications
First order ordinary differential equations and applicationsFirst order ordinary differential equations and applications
First order ordinary differential equations and applications
 
量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析量子アニーリングを用いたクラスタ分析
量子アニーリングを用いたクラスタ分析
 
統計的学習の基礎 4.4~
統計的学習の基礎 4.4~統計的学習の基礎 4.4~
統計的学習の基礎 4.4~
 
ディジタル信号処理 課題解説 その5
ディジタル信号処理 課題解説 その5ディジタル信号処理 課題解説 その5
ディジタル信号処理 課題解説 その5
 
FPDE presentation
FPDE presentationFPDE presentation
FPDE presentation
 
ガウス過程入門
ガウス過程入門ガウス過程入門
ガウス過程入門
 
Unit1
Unit1Unit1
Unit1
 
Lecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual InformationLecture 2: Entropy and Mutual Information
Lecture 2: Entropy and Mutual Information
 
Orthogonal sets and basis
Orthogonal sets and basisOrthogonal sets and basis
Orthogonal sets and basis
 
PRML 上 1.2.4 ~ 1.2.6
PRML 上 1.2.4 ~ 1.2.6PRML 上 1.2.4 ~ 1.2.6
PRML 上 1.2.4 ~ 1.2.6
 

Similar to Modeling the Dynamics of SGD by Stochastic Differential Equation

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationMark Chang
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDBenjamin Jaedon Choi
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniquesKrishna Gali
 
Differential calculus
Differential calculusDifferential calculus
Differential calculusChit Laplana
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Gabriel Peyré
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihoodDeep Learning JP
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsGabriel Peyré
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsA Jorge Garcia
 
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAlex (Oleksiy) Varfolomiyev
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Tomoya Murata
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017Fred J. Hickernell
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas EberleBigMC
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsVjekoslavKovac1
 

Similar to Modeling the Dynamics of SGD by Stochastic Differential Equation (20)

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
2.1 Calculus 2.formulas.pdf.pdf
2.1 Calculus 2.formulas.pdf.pdf2.1 Calculus 2.formulas.pdf.pdf
2.1 Calculus 2.formulas.pdf.pdf
 
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCDPhase diagram at finite T & Mu in strong coupling limit of lattice QCD
Phase diagram at finite T & Mu in strong coupling limit of lattice QCD
 
Integration techniques
Integration techniquesIntegration techniques
Integration techniques
 
Differential calculus
Differential calculusDifferential calculus
Differential calculus
 
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
Low Complexity Regularization of Inverse Problems - Course #2 Recovery Guaran...
 
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
【DL輪読会】Unbiased Gradient Estimation for Marginal Log-likelihood
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
Low Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse ProblemsLow Complexity Regularization of Inverse Problems
Low Complexity Regularization of Inverse Problems
 
Crib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC examsCrib Sheet AP Calculus AB and BC exams
Crib Sheet AP Calculus AB and BC exams
 
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface ProblemsAn Efficient Boundary Integral Method for Stiff Fluid Interface Problems
An Efficient Boundary Integral Method for Stiff Fluid Interface Problems
 
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
Doubly Accelerated Stochastic Variance Reduced Gradient Methods for Regulariz...
 
Fourier_Pricing_ICCF_2022.pdf
Fourier_Pricing_ICCF_2022.pdfFourier_Pricing_ICCF_2022.pdf
Fourier_Pricing_ICCF_2022.pdf
 
ch3.ppt
ch3.pptch3.ppt
ch3.ppt
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017QMC Error SAMSI Tutorial Aug 2017
QMC Error SAMSI Tutorial Aug 2017
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Andreas Eberle
Andreas EberleAndreas Eberle
Andreas Eberle
 
ME Reference.pdf
ME Reference.pdfME Reference.pdf
ME Reference.pdf
 
Trilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operatorsTrilinear embedding for divergence-form operators
Trilinear embedding for divergence-form operators
 

More from Mark Chang

Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the WeightsMark Chang
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningMark Chang
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningMark Chang
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain AdaptationMark Chang
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOWMark Chang
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsMark Chang
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial NetworksMark Chang
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksMark Chang
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly ProblemMark Chang
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterMark Chang
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習Mark Chang
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational AutoencoderMark Chang
 
TensorFlow 深度學習快速上手班--深度學習
 TensorFlow 深度學習快速上手班--深度學習 TensorFlow 深度學習快速上手班--深度學習
TensorFlow 深度學習快速上手班--深度學習Mark Chang
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用Mark Chang
 
TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用Mark Chang
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習Mark Chang
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10Mark Chang
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座Mark Chang
 

More from Mark Chang (20)

Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep Learning
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain Adaptation
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
TensorFlow 深度學習快速上手班--深度學習
 TensorFlow 深度學習快速上手班--深度學習 TensorFlow 深度學習快速上手班--深度學習
TensorFlow 深度學習快速上手班--深度學習
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10
 
Neural Doodle
Neural DoodleNeural Doodle
Neural Doodle
 
TensorFlow 深度學習講座
TensorFlow 深度學習講座TensorFlow 深度學習講座
TensorFlow 深度學習講座
 

Recently uploaded

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 

Modeling the Dynamics of SGD by Stochastic Differential Equation

  • 1. Modeling the Dynamics of SGD by Stochastic Differential Equation
  • 2. Outlines • Stochastic Gradient Descent (SGD) • Stochastic Differential Equation (SDE) • Continuous-time SGD & Controlled SGD • Effects of SGD on Generalization 2
  • 3. Outlines • Stochastic Gradient Descent (SGD) • Stochastic Differential Equation (SDE) • Continuous-time SGD & Controlled SGD • Effects of SGD on Generalization 3
  • 4. Stochastic Gradient Descent (SGD) • Gradient Descent v.s. Stochastic Gradient Descent Minimum Gradient Descent Batch size = Dataset size SGD Small batch size SGD Large batch size w : weights ⌘ : learning rate B : batch L(x, w) : loss function w w ⌘ 1 |B| X x2B @L(x, w) @w 4
  • 5. Stochastic Gradient Descent (SGD) • Convergence of SGD • Assume that the loss function is convex E[L(x, ¯w) L(x, w⇤ )]  o( 1 p T ) SGD ¯w Minimum w⇤ The distance is guaranteed to be small T : step counts ¯w : w after T steps w⇤ : w at the minimum of L 5
  • 6. Stochastic Gradient Descent (SGD) • Dynamics of SGD Minimum SGD The process between the starting point and the final solution 6
  • 7. Outlines • Stochastic Gradient Descent (SGD) • Stochastic Differential Equation (SDE) • Continuous-time SGD & Controlled SGD • Effects of SGD on Generalization 7
  • 8. Random Walk x t = tt = t 1/2 probability1/2 probability t = 0 0x ⇢ P(X t = x) = 1 2 P(X t = x) = 1 2 Position of the particle at t is a random variable X t such that 8
  • 9. Random Walk t = t t = 2 t 1/2 1/2 x x0 x x0 2 x2 x 1/4 1/2 1/4 x x0 2 x2 x 1/8 3/8 1/8 t = 3 t 3/8 3 x 3 x X t X2 t X3 t 9
  • 10. Random Walk Normal distribution D = ( x)2 t t = n t, n ! 1, t ! 0 Xt = N(0, Dt) Diffusion coefficient 10
  • 11. Stochastic Differential Equation (SDE) • Ordinary Differential Equation ⇢ dx(t) dt = b(x(t)), where t > 0 x(0) = x0 x0 x(t) Trajectory of x 11
  • 12. ⇢ dx(t) dt = b(x(t)) + B(x(t))dW (t) dt , where t > 0 and W(t) is a Wiener process x(0) = x0 Stochastic Differential Equation (SDE) • Stochastic Differential Equation x0 x(t) Trajectory samples of x Deterministic part Stochastic part (Random walk with infinitely small step) 12
  • 13. Stochastic Differential Equation (SDE) • Solving Stochastic Differential Equation ⇢ dx(t) dt = b(x(t)) + B(x(t))dW (t) dt x(0) = x0 ⇢ dx(t) = b(x(t))dt + B(x(t))dW(t) x(0) = x0 x(t) = x0 + Z t 0 b(x(s))ds + Z t 0 B(x(s))dW(s) Multiply both sides by dt Integrate both sides by dt Stochastic integral 13
  • 14. Stochastic Differential Equation (SDE) • Solution of a Stochastic Integral is a Random Variable If g : [0, 1] ! R is a deterministic function: E h Z 1 0 gdW i = 0 , and E h ( Z 1 0 gdW)2 i = Z 1 0 g2 dt If G is a stochastic process such that E h Z T 0 G2 dt i < 1 : E h Z T 0 GdW i = 0 , and E h ( Z T 0 GdW)2 i = E h Z T 0 G2 dt i mean variance mean variance 14
  • 15. Outlines • Stochastic Gradient Descent (SGD) • Stochastic Differential Equation (SDE) • Continuous-time SGD & Controlled SGD • Effects of SGD on Generalization 15
  • 16. Continuous-time SGD & Controlled SGD 16
  • 17. Continuous-time SGD & Controlled SGD • Notation Conventions: Gradient Descent : xk+1 = xk ⌘rf(xk) Stochastic Gradient Descent : xk+1 = xk ⌘rf k (xk) f : loss function xk : weights at step k k : index of training sample at step k (assume batch size is 1) fi : loss function calculated by batch i, where f(x) = (1/n)⌃n i=1fi(x) 17
  • 18. Continuous-time SGD & Controlled SGD xk+1 xk = ⌘rf k (xk) xk+1 xk = ⌘rf(xk) + p ⌘Vk Deterministic part Stochastic part minimum Deterministic partStochastic part Vk = p ⌘(rf(xk) rf k (xk)) mean of Vk : 0 covariance of Vk : ⌘⌃(xk), where ⌃(xk) = (1/n)⌃n i=1(rf(xk) rfi(xk))(rf(xk) rfi(xk))T 18
  • 19. Continuous-time SGD & Controlled SGD • Continuos-time SGD xk+1 xk = ⌘rf(xk) + p ⌘Vk Convert to continuous time domain dXt = rf(Xt)dt + p ⌘⌃(Xt)dWt dXt = r(f(Xt) + ⌘ 4 |rf(Xt)|2 )dt + p ⌘⌃(Xt)dWt Continuos-time SGD, Order 1 and Order 2 weak approximation 19
  • 20. Continuous-time SGD & Controlled SGD • A Toy Example • Continuous-time SGD (order 2 weak approximation): • Solution: f(x) = x2 , f1(x) = (x 1)2 1, f2(x) = (x + 1)2 1 f2(x) f1(x)f(x) dXt = 2(1 + ⌘)Xtdt + 2 p ⌘dWt Xt ⇠ N(x0e 2(1+⌘)t , ⌘ 1 + ⌘ (1 e 4(1+⌘)t )) 20
  • 21. Continuous-time SGD & Controlled SGD Xt ⇠ N(x0e 2(1+⌘)t , ⌘ 1 + ⌘ (1 e 4(1+⌘)t )) t x E[Xt] = ⇢ x0, when t = 0 0, when t ! 1 x0 Var[Xt] = ⇢ 0, when t = 0 ⌘ 1+⌘ , when t ! 1 E[Xt⇤ ] = p Var[Xt⇤ ] Fluctuations phaseDescent phase r ⌘ 1 + ⌘ 21
  • 22. Continuous-time SGD & Controlled SGD • Controlled SGD : Adaptive Hyper-parameter Adjustment xk+1 = xk ⌘ukf0 (xk), where uk 2 [0, 1] is adjustment factor Optimal Control Formulation min ut Ef(Xt) subject to : dXt = utf0 (Xt)dt + ut p ⌘⌃(Xt)dWt 22
  • 23. Continuous-time SGD & Controlled SGD • Quadratic Objective Function • Continuous-time SGD: • Optimal control policy : dXt = aut(Xt b)dt + ut p ⌘⌃dWt f(x) = 1 2 a(x b)2 , assume the covariance of f0 i is ⌘⌃(x) u⇤ t = ⇢ 1 if a  0 or t  t⇤ 1 1+a(t t⇤) if a > 0 and t > t⇤ 23
  • 24. Continuous-time SGD & Controlled SGD • Optimal control policy u⇤ t = ⇢ 1 if a  0 or t  t⇤ , ( t  t⇤ is descent phase) 1 1+a(t t⇤) if a > 0 and t > t⇤ , ( t > t⇤ is fluctuations phase) t x Fluctuations phase Descent phase t⇤ a  0 a > 0 f(x) = 1 2 a(x b)2 , assume the covariance of f0 i is ⌘⌃(x) 24
  • 25. Continuous-time SGD & Controlled SGD • General Objective Function f(x) and fi(x) is not necessarily quadratic, and x 2 Rd assume f(x) ⇡ 1 2 dX i=1 a(i)(x(i) b(i))2 hold locally in x, and ⌃ ⇡ diag{⌃(1), ..., ⌃(d)} where each ⌃(i) is locally constant. (each dimension is independent) 25
  • 26. Continuous-time SGD & Controlled SGD • Controlled SGD Algorithms At each step k, estimate ak,(i), bk,(i) for 1 2 ak,(i)(xk,(i) bk,(i))2 . Since rf(i) ⇡ a(i)(x(i) b(i)), we use linear regression to estimate ak,(i), bk,(i): 1 2 ak,(i)(xk,(i) bk,(i))2 xk,(i) xk 1,(i)ak,(i) = gxk,(i) gk,(i)xk,(i) x2 k,(i) x2 k,(i) , and bk,(i) = xk,(i) gk,(i) ak,(i) where gk,(i) = rf k (xk)(i), and gk+1,(i) = k,(i)gk,(i) + (1 k,(i))gk,(i) Exponential moving average 26
  • 27. Continuous-time SGD & Controlled SGD • Controlled SGD Algorithms Solve the optimal control policy u⇤ k,(i) for 1 2 ak,(i)(xk,(i) bk,(i))2 u⇤ k,(i) = ( 1 if a  0, min(1, ak,(i)(¯xk,(i) bk,(i))2 ⌘⌃k,(i) ) if a > 0 . where ⌃k,(i) = g2 k,(i) ¯g2 k,(i) 27
  • 28. MNIST fully connected NN CIFAR-10 fully connected NN CIFAR-10 CNN 28
  • 29. Continuous-time SGD & Controlled SGD • Implementation of cSGD • https://github.com/LiQianxiao/cSGD-cMSGD 29
  • 30. Outlines • Stochastic Gradient Descent (SGD) • Random Walk, Diffusion and Wiener process • Stochastic Differential Equation (SDE) • Effects of SGD on Generalization 30
  • 31. Effects of SGD on Generalization ICANN 2018 31
  • 32. Effects of SGD on Generalization • Notation Conventions: 32 Loss function : L(✓) = 1 N NX n=1 l(✓, xn), where N is the size of dataset GD : ✓k+1 = ✓k ⌘g(✓k), where g(✓) = @L @✓ SGD : ✓k+1 = ✓k ⌘g(S) (✓k), where gS (✓) = 1 S X n2B @ @✓ l(✓, xn), B is batch and S is batch size
  • 33. Effects of SGD on Generalization • Continuous-time SGD d✓ = g(✓)dt + r ⌘ S R(✓)dW(t), where R(✓)R(✓)T = C(✓) and C(✓) S is the covariance of ⇣ g(S) (✓) g(✓) ⌘ 33
  • 34. Effects of SGD on Generalization • Effects of different learning rate and batch size Minimum small ⌘ large ⌘ Minimum small S large S d✓ = g(✓)dt + r ⌘ S R(✓)dW(t), where R(✓)R(✓)T = C(✓) and C(✓) is the covariance of g(✓) 34
  • 35. Effects of SGD on Generalization • Flat minimum v.s. Sharp minimum (https://arxiv.org/abs/1609.04836) Loss function Loss function (evaluated on testing data) Flat minimum Sharp minimum High testing error Low testing error 35
  • 36. Effects of SGD on Generalization • Effects of learning rate / batch size on generalization d✓ = g(✓)dt + r ⌘ S R(✓)dW(t), where R(✓)R(✓)T = C(✓) and C(✓) is the covariance of g(✓) Flat Minimum Sharp Minimum Flat Minimum Sharp Minimum large r ⌘ S , (large ⌘, small S)small r ⌘ S , (small ⌘, large S) 36
  • 37. Effects of SGD on Generalization • Theoretical Explanation • Assumption 1 : the loss surface can be approximated by a quadratic bowl, with minimum at zero loss. • Assumption 2 : The covariance of the gradients and the Hessian of the loss approximation are approximately equal, i.e. C = H. 37
  • 38. Effects of SGD on Generalization • Theoretical Explanation d✓ = g(✓)dt + r ⌘ S R(✓)dW(t), where R(✓)R(✓)T = C(✓) and C(✓) is the covariance of g(✓) dz = ⇤zdt + r ⌘ S p ⇤dW(t) Change of variables: z : New variable, where z = V T (✓ ✓⇤ ) ✓⇤ : The parameters at the minimum V : Orthogonal matrix of the eigen decomposition H = V ⇤V T H : The Hession of L(✓) 38
  • 39. Effects of SGD on Generalization • Theoretical Explanation • Expectation of loss function dz = ⇤zdt + r ⌘ S p ⇤dW(t) E(L) = 1 2 qX i=1 iE(z2 i ) = ⌘ 4S Tr(⇤) = ⌘ 4S Tr(H) Ornstein-Unhlenbeck process for z, solution : E[z] = 0 and cov[z] = ⌘ 2S I 39
  • 40. • Theoretical Explanation Effects of SGD on Generalization E = 1 2 qX i=1 iE(z2 i ) = ⌘ 4S Tr(⇤) = ⌘ 4S Tr(H) =) E(L) Tr(H) / ⌘ S low ⌘ S Sharp minimum : low E(L) Tr(H) Flat munimum : high E(L) Tr(H) (Minima with similar loss values) high ⌘ S 40
  • 41. Increasing LR/BS, Increasing accuracy Similar LR/BS, Similar accuracy 41
  • 42. Tips for Tuning Batch Size and Learning Rate • Learning rate can be decayed when epoch increases. • Learning rate should not be initialized from small value. • To keep the validation accuracy, LR/BS should remain constant when changing batch size. • To achieve higher validation accuracy, increase learning rate or reduce batch size. 42
  • 43. Further Readings • An Introduction to Stochastic Differential Equations • http://ft- sipil.unila.ac.id/dbooks/AN%20INTRODUCTION%20TO%20STOCHASTIC %20DIFFERENTIAL%20EQUATIONS%20VERSION%201.2.pdf • Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms • https://arxiv.org/abs/1511.06251 • Three Factors Influencing Minima in SGD • https://arxiv.org/abs/1710.11029 43