SlideShare a Scribd company logo
1 of 86
Download to read offline
Zap Q-Learning
Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms
Center for Systems and Control (CSC@USC)
and Ming Hsieh Institute for Electrical Engineering
February 21, 2018
Adithya M. Devraj Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida
Zap Q-Learning
Outline
1 Stochastic Approximation
2 Fastest Stochastic Approximation
3 Reinforcement Learning
4 Zap Q-Learning
5 Conclusions & Future Work
6 References
E[f(θ,W)]
θ=θ∗
= 0
Stochastic Approximation
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
1 / 31
Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n))
The recursive algorithms we come up with are often slow, and their
variance may be infinite: typical in Q-learning [Devraj & M 2017]
1 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
2 / 31
Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒
lim
n→∞
θ(n) = θ∗
2 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable:
η = c(x) fX(x) dx
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
3 / 31
Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
=⇒ (n + 1)θ(n + 1) =
n+1
i=1
c(X(i)) = nθ(n) + c(X(n + 1))
=⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)]
SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1))
αn = ∞, α2
n < ∞
3 / 31
θ(k)
k
Fastest Stochastic Approximation
Fastest Stochastic Approximation Algorithm Performance
Performance Criteria
Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗:
1 Finite-n bound:
P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2
)
2 Asymptotic covariance:
Σ = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
4 / 31
Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Σ = lim
n→∞
Σn = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
SA recursion for covariance:
Σn+1 ≈ Σn + 1
n (A + 1
2I)Σn + Σn(A + 1
2I)T
+ Σ∆
A = d
dθ
¯f (θ∗)
Conclusions
1 If Re λ(A) ≥ −1
2 for some eigenvalue then Σ is (typically) infinite
2 If Re λ(A) < −1
2 for all, then Σ = limn→∞ Σn is the unique solution
to the Lyapunov equation:
0 = (A + 1
2I)Σ + Σ(A + 1
2I)T
+ Σ∆
5 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
Polyak-Ruppert averaging is also optimal, but first two bullets are missing.
6 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
∆(n) = X(n) − E[X(n)]
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5 g
σ2
∆
Σ =
σ2
∆
2
g2
g − 1/2
Asymptotic variance as a function of g
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5
t 104
0.4
0.6
0.8
1
1.2
(t)
20 30.8
10 15.8
1 3
0.5
0.1
g
SA estimates of E[W2
], W ∼ N(0, 1)
7 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
Requires An ≈ A(θn) :=
d
dθ
¯f (θn)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
8 / 31
Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
Not necessarily stable (just like in deterministic Newton-Raphson)
General conditions for convergence is open 8 / 31
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning
and Stochastic Approximation
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Φ(n) = (state, action)
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
9 / 31
Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
These algorithms are thus special cases of stochastic approximation
(as we all know)
9 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
10 / 31
Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
10 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-function
Trick to swap expectation and minimum
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
Q-function:
Q∗
(x, u) := c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]
h∗
(x) = min
u
Q∗
(x, u)
Another Bellman equation:
Q∗
(x, u) = c(x, u) + βE[Q∗
(X(n + 1)) | X(n) = x, U(n) = u]
Q∗
(x) = min
u
Q∗
(x, u)
11 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
That is,
0 = E[F(Q∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] ,
with Φ(n + 1) = (X(n + 1), X(n), U(n)).
12 / 31
Reinforcement Learning Q-Learning
Q-Learning and Galerkin Relaxation
Dynamic programming
Find function Q∗ that solves
E c(X(n), U(n)) + βQ∗
(X(n + 1)) − Q∗
(X(n), U(n)) | Fn = 0
Q-Learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
The family {Qθ} and eligibility vectors {ζn} are part of algorithm design.
12 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
Reinforcement Learning Q-Learning
Watkins’ Q-learning
Big Question: Can we Zap Q-Learning?
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap Q-Learning
Zap Q-Learning
Asymptotic Covariance of Watkins’ Q-Learning
Improvements are needed!
1
4
65
3 2
Histogram of parameter estimates after 106 iterations.
1000 200 300 400 486.6
0
10
20
30
40
n = 106
Histogram for θ
θ*
n(15)
(15)
Example from Devraj & M 2017
14 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
0 = ¯f(θ) = E f(θ, W(n))
:= E ζn c(X(n), U(n)) + βQθ
(X(n + 1)) − Qθ
(X(n), U(n))
A(θ) = d
dθ
¯f (θ); At points of differentiability:
A(θ) = E ζn βψ(X(n + 1), φθ
(X(n + 1))) − ψ(X(n), U(n))
T
φθ
(X(n + 1)) := arg min
u
Qθ
(X(n + 1), u)
Algorithm:
θ(n + 1)= θ(n) + αn(−An)−1
(f(θ(n), Φ(n))); An = An−1 + γn(An − An−1);
An+1 :=
d
dθ
f (θn, Φ(n))
= ζn βψ(X(n + 1), φθn
(X(n + 1))) − ψ(X(n), U(n))
T
15 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-learning
Zap Q-Learning ≡ Zap-SNR for Q-Learning
ODE Analysis: change of variables q = Q∗(ς)
Functional Q∗ maps cost functions to Q-functions:
q(x, u) = ς(x, u) + β
x
Pu(x, x ) min
u
q(x , u )
ODE for Zap-Q
qt = Q∗
(ςt),
d
dt
ςt = −ςt + c
⇒ convergence, optimal covariance, ...
16 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
Optimal scalar gain is approximately αn = 1500/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
18 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: β = 0.99
19 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Watkins
BellmanError
Histogramsn=106
20 / 31
Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
g = 500
g = 1500
Speedy
Poly
g = 5000
10
3
10
4
10
5
10
6
100
101
102
103
104
10
3
10
4
10
5
10
6
n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
WatkinsWatkins
BellmanErrorBellmanError
Histogramsn=106
2σ confidence intervals for the Q-learning algorithms
20 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1
2
21 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q
-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
n˜θn
Entry #1: n = 2 × 106
Entry #7: n = 2 × 106
22 / 31
Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
g = 100
g = 200
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
23 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
24 / 31
Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
Finite-time analysis
24 / 31
Conclusions & Future Work
Thank you!
thankful
25 / 31
References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
26 / 31
References
This lecture
A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural
Information Processing Systems (NIPS). Dec. 2017.
A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available
on ArXiv. Jul. 2017.
27 / 31
References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,
2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
28 / 31
References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages
98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
29 / 31
References
Selected References III
[13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
30 / 31
References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
31 / 31

More Related Content

What's hot

PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...
PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...
PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...Hyeongmin Lee
 
強化学習その3
強化学習その3強化学習その3
強化学習その3nishio
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learningDeep Learning JP
 
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法Shuhei Sowa
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1Dongmin Lee
 
分布あるいはモーメント間距離最小化に基づく統計的音声合成
分布あるいはモーメント間距離最小化に基づく統計的音声合成分布あるいはモーメント間距離最小化に基づく統計的音声合成
分布あるいはモーメント間距離最小化に基づく統計的音声合成Shinnosuke Takamichi
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learningRidge-i, Inc.
 
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...Deep Learning JP
 
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic SegmentationDeep Learning JP
 
Object tracking survey
Object tracking surveyObject tracking survey
Object tracking surveyRich Nguyen
 
[DL輪読会]モデルベース強化学習とEnergy Based Model
[DL輪読会]モデルベース強化学習とEnergy Based Model[DL輪読会]モデルベース強化学習とEnergy Based Model
[DL輪読会]モデルベース強化学習とEnergy Based ModelDeep Learning JP
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...Deep Learning JP
 
多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識佑 甲野
 
自由エネルギー原理と視覚的意識 2019-06-08
自由エネルギー原理と視覚的意識 2019-06-08自由エネルギー原理と視覚的意識 2019-06-08
自由エネルギー原理と視覚的意識 2019-06-08Masatoshi Yoshida
 
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSESDeep Learning JP
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Dongmin Lee
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splattingtaeseon ryu
 

What's hot (20)

PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...
PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...
PR-228: Geonet: Unsupervised learning of dense depth, optical flow and camera...
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
Matrix calculus
Matrix calculusMatrix calculus
Matrix calculus
 
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法
マルコフモデル,隠れマルコフモデルとコネクショニスト時系列分類法
 
深層学習入門
深層学習入門深層学習入門
深層学習入門
 
강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1강화학습의 흐름도 Part 1
강화학습의 흐름도 Part 1
 
分布あるいはモーメント間距離最小化に基づく統計的音声合成
分布あるいはモーメント間距離最小化に基づく統計的音声合成分布あるいはモーメント間距離最小化に基づく統計的音声合成
分布あるいはモーメント間距離最小化に基づく統計的音声合成
 
Introduction to Few shot learning
Introduction to Few shot learningIntroduction to Few shot learning
Introduction to Few shot learning
 
Depth estimation using deep learning
Depth estimation using deep learningDepth estimation using deep learning
Depth estimation using deep learning
 
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...
【DL輪読会】HRDA: Context-Aware High-Resolution Domain-Adaptive Semantic Segmentat...
 
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
[DL輪読会]Geometric Unsupervised Domain Adaptation for Semantic Segmentation
 
Object tracking survey
Object tracking surveyObject tracking survey
Object tracking survey
 
[DL輪読会]モデルベース強化学習とEnergy Based Model
[DL輪読会]モデルベース強化学習とEnergy Based Model[DL輪読会]モデルベース強化学習とEnergy Based Model
[DL輪読会]モデルベース強化学習とEnergy Based Model
 
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
[DL輪読会]“SimPLe”,“Improved Dynamics Model”,“PlaNet” 近年のVAEベース系列モデルの進展とそのモデルベース...
 
多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識多様な強化学習の概念と課題認識
多様な強化学習の概念と課題認識
 
自由エネルギー原理と視覚的意識 2019-06-08
自由エネルギー原理と視覚的意識 2019-06-08自由エネルギー原理と視覚的意識 2019-06-08
自由エネルギー原理と視覚的意識 2019-06-08
 
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
[DL輪読会]DIVERSE TRAJECTORY FORECASTING WITH DETERMINANTAL POINT PROCESSES
 
Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)Maximum Entropy Reinforcement Learning (Stochastic Control)
Maximum Entropy Reinforcement Learning (Stochastic Control)
 
3D Gaussian Splatting
3D Gaussian Splatting3D Gaussian Splatting
3D Gaussian Splatting
 

Similar to Introducing Zap Q-Learning

Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Sean Meyn
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169Ryan White
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance MattersSean Meyn
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsSean Meyn
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPK Lehre
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsPer Kristian Lehre
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methodsChristian Robert
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisVjekoslavKovac1
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeVjekoslavKovac1
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPer Kristian Lehre
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsPK Lehre
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Fabian Pedregosa
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsChristian Robert
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30Ryan White
 
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...AIST
 

Similar to Introducing Zap Q-Learning (20)

Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018Zap Q-Learning - ISMP 2018
Zap Q-Learning - ISMP 2018
 
DissertationSlides169
DissertationSlides169DissertationSlides169
DissertationSlides169
 
DeepLearn2022 2. Variance Matters
DeepLearn2022  2. Variance MattersDeepLearn2022  2. Variance Matters
DeepLearn2022 2. Variance Matters
 
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
2018 MUMS Fall Course - Statistical Representation of Model Input (EDITED) - ...
 
stochastic processes assignment help
stochastic processes assignment helpstochastic processes assignment help
stochastic processes assignment help
 
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast AlgorithmsReinforcement Learning: Hidden Theory and New Super-Fast Algorithms
Reinforcement Learning: Hidden Theory and New Super-Fast Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Simplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution AlgorithmsSimplified Runtime Analysis of Estimation of Distribution Algorithms
Simplified Runtime Analysis of Estimation of Distribution Algorithms
 
Convergence of ABC methods
Convergence of ABC methodsConvergence of ABC methods
Convergence of ABC methods
 
Scattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysisScattering theory analogues of several classical estimates in Fourier analysis
Scattering theory analogues of several classical estimates in Fourier analysis
 
A Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cubeA Szemeredi-type theorem for subsets of the unit cube
A Szemeredi-type theorem for subsets of the unit cube
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Runtime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary AlgorithmsRuntime Analysis of Population-based Evolutionary Algorithms
Runtime Analysis of Population-based Evolutionary Algorithms
 
Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3Random Matrix Theory and Machine Learning - Part 3
Random Matrix Theory and Machine Learning - Part 3
 
Delayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithmsDelayed acceptance for Metropolis-Hastings algorithms
Delayed acceptance for Metropolis-Hastings algorithms
 
GradStudentSeminarSept30
GradStudentSeminarSept30GradStudentSeminarSept30
GradStudentSeminarSept30
 
Randomized algorithms ver 1.0
Randomized algorithms ver 1.0Randomized algorithms ver 1.0
Randomized algorithms ver 1.0
 
Nested sampling
Nested samplingNested sampling
Nested sampling
 
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
QMC: Operator Splitting Workshop, Using Sequences of Iterates in Inertial Met...
 
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
Gracheva Inessa - Fast Global Image Denoising Algorithm on the Basis of Nonst...
 

More from Sean Meyn

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Sean Meyn
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfSean Meyn
 
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningSean Meyn
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Sean Meyn
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019Sean Meyn
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power GridSean Meyn
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchSean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesSean Meyn
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Sean Meyn
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Sean Meyn
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Sean Meyn
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridSean Meyn
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Sean Meyn
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power ProductsSean Meyn
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex SystemsSean Meyn
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Sean Meyn
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekSean Meyn
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010Sean Meyn
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Sean Meyn
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsSean Meyn
 

More from Sean Meyn (20)

Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
Quasi-Stochastic Approximation: Algorithm Design Principles with Applications...
 
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdfDeepLearn2022 1. Goals & AlgorithmDesign.pdf
DeepLearn2022 1. Goals & AlgorithmDesign.pdf
 
DeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q LearningDeepLearn2022 3. TD and Q Learning
DeepLearn2022 3. TD and Q Learning
 
Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019Smart Grid Tutorial - January 2019
Smart Grid Tutorial - January 2019
 
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
State Space Collapse in Resource Allocation for Demand Dispatch - May 2019
 
Irrational Agents and the Power Grid
Irrational Agents and the Power GridIrrational Agents and the Power Grid
Irrational Agents and the Power Grid
 
State estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatchState estimation and Mean-Field Control with application to demand dispatch
State estimation and Mean-Field Control with application to demand dispatch
 
Demand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary ServicesDemand-Side Flexibility for Reliable Ancillary Services
Demand-Side Flexibility for Reliable Ancillary Services
 
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
Spectral Decomposition of Demand-Side Flexibility for Reliable Ancillary Serv...
 
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
Demand-Side Flexibility for Reliable Ancillary Services in a Smart Grid: Elim...
 
Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?Why Do We Ignore Risk in Power Economics?
Why Do We Ignore Risk in Power Economics?
 
Distributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power GridDistributed Randomized Control for Ancillary Service to the Power Grid
Distributed Randomized Control for Ancillary Service to the Power Grid
 
Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...Ancillary service to the grid from deferrable loads: the case for intelligent...
Ancillary service to the grid from deferrable loads: the case for intelligent...
 
2012 Tutorial: Markets for Differentiated Electric Power Products
2012 Tutorial:  Markets for Differentiated Electric Power Products2012 Tutorial:  Markets for Differentiated Electric Power Products
2012 Tutorial: Markets for Differentiated Electric Power Products
 
Control Techniques for Complex Systems
Control Techniques for Complex SystemsControl Techniques for Complex Systems
Control Techniques for Complex Systems
 
Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010Tutorial for Energy Systems Week - Cambridge 2010
Tutorial for Energy Systems Week - Cambridge 2010
 
Panel Lecture for Energy Systems Week
Panel Lecture for Energy Systems WeekPanel Lecture for Energy Systems Week
Panel Lecture for Energy Systems Week
 
The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010The Value of Volatile Resources... Caltech, May 6 2010
The Value of Volatile Resources... Caltech, May 6 2010
 
Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...Approximate dynamic programming using fluid and diffusion approximations with...
Approximate dynamic programming using fluid and diffusion approximations with...
 
Anomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov ModelsAnomaly Detection Using Projective Markov Models
Anomaly Detection Using Projective Markov Models
 

Recently uploaded

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...amitlee9823
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...amitlee9823
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Standamitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachBoston Institute of Analytics
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...karishmasinghjnh
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...amitlee9823
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...amitlee9823
 

Recently uploaded (20)

Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men  🔝Bangalore🔝   Esc...
➥🔝 7737669865 🔝▻ Bangalore Call-girls in Women Seeking Men 🔝Bangalore🔝 Esc...
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night StandCall Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Doddaballapur Road ☎ 7737669865 🥵 Book Your One night Stand
 
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Rabindra Nagar  (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Rabindra Nagar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men  🔝malwa🔝   Escorts Ser...
➥🔝 7737669865 🔝▻ malwa Call-girls in Women Seeking Men 🔝malwa🔝 Escorts Ser...
 
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bellandur ☎ 7737669865 🥵 Book Your One night Stand
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Detecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning ApproachDetecting Credit Card Fraud: A Machine Learning Approach
Detecting Credit Card Fraud: A Machine Learning Approach
 
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
👉 Amritsar Call Girl 👉📞 6367187148 👉📞 Just📲 Call Ruhi Call Girl Phone No Amri...
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
 

Introducing Zap Q-Learning

  • 1. Zap Q-Learning Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering February 21, 2018 Adithya M. Devraj Sean P. Meyn Department of Electrical and Computer Engineering — University of Florida
  • 2. Zap Q-Learning Outline 1 Stochastic Approximation 2 Fastest Stochastic Approximation 3 Reinforcement Learning 4 Zap Q-Learning 5 Conclusions & Future Work 6 References
  • 4. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 1 / 31
  • 5. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 / 31
  • 6. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 1 / 31
  • 7. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 1 / 31
  • 8. Stochastic Approximation Basic Algorithm What is Stochastic Approximation? A simple goal: Find the solution θ∗ to ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 What makes this hard? 1 The function f and the distribution of the random vector W may not be known – we may only know something about the structure of the problem 2 Even if everything is known, computation of the expectation may be expensive. For root finding, we may need to compute the expectation for many values of θ 3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) The recursive algorithms we come up with are often slow, and their variance may be infinite: typical in Q-learning [Devraj & M 2017] 1 / 31
  • 9. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) 2 / 31
  • 10. Stochastic Approximation ODE Method Algorithm and Convergence Analysis Algorithm: θ(n + 1) = θ(n) + αnf(θ(n), W(n)) Goal: ¯f(θ∗ ) := E[f(θ, W)] θ=θ∗ = 0 Interpretation: θ∗ ≡ stationary point of the ODE d dt ϑ(t) = ¯f(ϑ(t)) Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒ lim n→∞ θ(n) = θ∗ 2 / 31
  • 11. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable: η = c(x) fX(x) dx 3 / 31
  • 12. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) 3 / 31
  • 13. Stochastic Approximation SA Example Stochastic Approximation Example Example: Monte-Carlo Monte-Carlo Estimation Estimate the mean η = c(X), where X is a random variable SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ] Algorithm: θ(n) = 1 n n i=1 c(X(i)) =⇒ (n + 1)θ(n + 1) = n+1 i=1 c(X(i)) = nθ(n) + c(X(n + 1)) =⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)] SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1)) αn = ∞, α2 n < ∞ 3 / 31
  • 15. Fastest Stochastic Approximation Algorithm Performance Performance Criteria Two standard approaches to evaluate performance, ˜θ(n) := θ(n) − θ∗: 1 Finite-n bound: P{ ˜θ(n) ≥ ε} ≤ exp(−I(ε, n)) , I(ε, n) = O(nε2 ) 2 Asymptotic covariance: Σ = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) 4 / 31
  • 16. Fastest Stochastic Approximation Algorithm Performance Asymptotic Covariance Σ = lim n→∞ Σn = lim n→∞ nE ˜θ(n)˜θ(n)T , √ n˜θ(n) ≈ N(0, Σ) SA recursion for covariance: Σn+1 ≈ Σn + 1 n (A + 1 2I)Σn + Σn(A + 1 2I)T + Σ∆ A = d dθ ¯f (θ∗) Conclusions 1 If Re λ(A) ≥ −1 2 for some eigenvalue then Σ is (typically) infinite 2 If Re λ(A) < −1 2 for all, then Σ = limn→∞ Σn is the unique solution to the Lyapunov equation: 0 = (A + 1 2I)Σ + Σ(A + 1 2I)T + Σ∆ 5 / 31
  • 17. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) 6 / 31
  • 18. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . 6 / 31
  • 19. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G 6 / 31
  • 20. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance Introduce a d × d matrix gain sequence {Gn}: θ(n + 1) = θ(n) + 1 n + 1 Gnf(θ(n), X(n)) Assume it converges, and linearize: ˜θ(n + 1) ≈ ˜θ(n) + 1 n + 1 G A˜θ(n) + ∆(n + 1) , A = d dθ ¯f (θ∗ ) . If G = G∗ := −A−1 then Resembles Monte-Carlo estimate Resembles Newton-Rapshon It is optimal: Σ∗ = G∗Σ∆G∗T ≤ ΣG any other G Polyak-Ruppert averaging is also optimal, but first two bullets are missing. 6 / 31
  • 21. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) 7 / 31
  • 22. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance Example: return to Monte-Carlo θ(n + 1) = θ(n) + g n + 1 −θ(n) + X(n + 1) ∆(n) = X(n) − E[X(n)] 7 / 31
  • 23. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) 7 / 31
  • 24. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 7 / 31
  • 25. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 g σ2 ∆ Σ = σ2 ∆ 2 g2 g − 1/2 Asymptotic variance as a function of g 7 / 31
  • 26. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Variance ∆(n) = X(n) − E[X(n)]Normalization for analysis: ˜θ(n + 1) = ˜θ(n) + g n + 1 −˜θ(n) + ∆(n + 1) Example: X(n) = W2(n), W ∼ N(0, 1) 0 1 2 3 4 5 t 104 0.4 0.6 0.8 1 1.2 (t) 20 30.8 10 15.8 1 3 0.5 0.1 g SA estimates of E[W2 ], W ∼ N(0, 1) 7 / 31
  • 27. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) Requires An ≈ A(θn) := d dθ ¯f (θn) 8 / 31
  • 28. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) 8 / 31
  • 29. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ 8 / 31
  • 30. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) 8 / 31
  • 31. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) 8 / 31
  • 32. Fastest Stochastic Approximation Stochastic Newton Raphson Optimal Asymptotic Covariance and Zap-SNR Zap-SNR (designed to emulate deterministic Newton-Raphson) θ(n + 1) = θ(n) + αn(−An)−1 f(θ(n), X(n)) An = An−1 + γn(An − An−1), An = d dθ f(θ(n), X(n)) An ≈ A(θn) requires high-gain, γn αn → ∞, n → ∞ Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1) ODE for Zap-SNR d dt xt = (−A(xt))−1 ¯f (xt), A(x) = d dx ¯f (x) Not necessarily stable (just like in deterministic Newton-Raphson) General conditions for convergence is open 8 / 31
  • 35. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? 9 / 31
  • 36. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Φ(n) = (state, action) 9 / 31
  • 37. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? 9 / 31
  • 38. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning 9 / 31
  • 39. Reinforcement Learning RL & SA SA and RL Design Functional equations in Stochastic Control Always of the form 0 = E[F(h∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗ = ? Galerkin relaxation: 0 = E[F(hθ∗ , Φ(n + 1))ζn] , θ∗ = ? Necessary Ingredients: Parameterized family {hθ : θ ∈ Rd} Adapted, d-dimensional stochastic process {ζn} Examples are TD- and Q-Learning These algorithms are thus special cases of stochastic approximation (as we all know) 9 / 31
  • 40. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor 10 / 31
  • 41. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] 10 / 31
  • 42. Reinforcement Learning MDP Theory Stochastic Optimal Control MDP Model X is a stationary controlled Markov chain, with input U For all states x and sets A, P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A) c: X × U → R is a cost function β < 1 a discount factor Value function: h∗ (x) = min U ∞ n=0 βn E[c(X(n), U(n)) | X(0) = x] Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 10 / 31
  • 43. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} 11 / 31
  • 44. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] 11 / 31
  • 45. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) 11 / 31
  • 46. Reinforcement Learning Q-Learning Q-function Trick to swap expectation and minimum Bellman equation: h∗ (x) = min u {c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u]} Q-function: Q∗ (x, u) := c(x, u) + βE[h∗ (X(n + 1)) | X(n) = x, U(n) = u] h∗ (x) = min u Q∗ (x, u) Another Bellman equation: Q∗ (x, u) = c(x, u) + βE[Q∗ (X(n + 1)) | X(n) = x, U(n) = u] Q∗ (x) = min u Q∗ (x, u) 11 / 31
  • 47. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 12 / 31
  • 48. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 That is, 0 = E[F(Q∗ , Φ(n + 1)) | Φ0 . . . Φ(n)] , with Φ(n + 1) = (X(n + 1), X(n), U(n)). 12 / 31
  • 49. Reinforcement Learning Q-Learning Q-Learning and Galerkin Relaxation Dynamic programming Find function Q∗ that solves E c(X(n), U(n)) + βQ∗ (X(n + 1)) − Q∗ (X(n), U(n)) | Fn = 0 Q-Learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 The family {Qθ} and eligibility vectors {ζn} are part of algorithm design. 12 / 31
  • 50. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 13 / 31
  • 51. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) 13 / 31
  • 52. Reinforcement Learning Q-Learning Watkins’ Q-learning Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 53. Reinforcement Learning Q-Learning Watkins’ Q-learning Big Question: Can we Zap Q-Learning? Find θ∗ that solves E c(X(n), U(n)) + βQθ∗ ((X(n + 1)) − Qθ∗ ((X(n), U(n)) ζn = 0 Watkin’s algorithm is Stochastic Approximation The family {Qθ} and eligibility vectors {ζn} in this design: Linearly parameterized family of functions: Qθ(x, u) = θT ψ(x, u) ζn ≡ ψ(Xn, Un) and ψn(x, u) = 1{x = xn, u = un} (complete basis) Asymptotic covariance is typically infinite 13 / 31
  • 54. 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap Q-Learning
  • 55. Zap Q-Learning Asymptotic Covariance of Watkins’ Q-Learning Improvements are needed! 1 4 65 3 2 Histogram of parameter estimates after 106 iterations. 1000 200 300 400 486.6 0 10 20 30 40 n = 106 Histogram for θ θ* n(15) (15) Example from Devraj & M 2017 14 / 31
  • 56. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) 15 / 31
  • 57. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); 15 / 31
  • 58. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) 15 / 31
  • 59. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); 15 / 31
  • 60. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning 0 = ¯f(θ) = E f(θ, W(n)) := E ζn c(X(n), U(n)) + βQθ (X(n + 1)) − Qθ (X(n), U(n)) A(θ) = d dθ ¯f (θ); At points of differentiability: A(θ) = E ζn βψ(X(n + 1), φθ (X(n + 1))) − ψ(X(n), U(n)) T φθ (X(n + 1)) := arg min u Qθ (X(n + 1), u) Algorithm: θ(n + 1)= θ(n) + αn(−An)−1 (f(θ(n), Φ(n))); An = An−1 + γn(An − An−1); An+1 := d dθ f (θn, Φ(n)) = ζn βψ(X(n + 1), φθn (X(n + 1))) − ψ(X(n), U(n)) T 15 / 31
  • 61. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) 16 / 31
  • 62. Zap Q-Learning Watkin’s algorithm Zap Q-learning Zap Q-Learning ≡ Zap-SNR for Q-Learning ODE Analysis: change of variables q = Q∗(ς) Functional Q∗ maps cost functions to Q-functions: q(x, u) = ς(x, u) + β x Pu(x, x ) min u q(x , u ) ODE for Zap-Q qt = Q∗ (ςt), d dt ςt = −ςt + c ⇒ convergence, optimal covariance, ... 16 / 31
  • 63. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 17 / 31
  • 64. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 65. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 66. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Example: Stochastic Shortest Path 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n Optimal scalar gain is approximately αn = 1500/n 0 1 2 3 4 5 6 7 8 9 10 105 0 20 40 60 80 100 Watkins, Speedy Q-learning, Polyak-Ruppert Averaging Zap BellmanError n Watkins, g = 1500 Zap, γn = αn Convergence of Zap-Q Learning Discount factor: β = 0.99 17 / 31
  • 67. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance 18 / 31
  • 68. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2 Convergence with Zap gain γn = n−0.85 -2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8 n = 104 n = 106 Theoritical pdf Experimental pdf Empirical: 1000 trialsWn = √ n˜θn Entry #18: n = 104 n = 106 Entry #10: CLT gives good prediction of finite-n performance Discount factor: β = 0.99 19 / 31
  • 69. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 Polyak-Ruppert B0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Watkins BellmanError Histogramsn=106 20 / 31
  • 70. Zap Q-Learning Watkin’s algorithm Zap Q-Learning Optimize Walk to Cafe 1 4 65 3 2Local Convergence: θ(0) initialized in neighborhood of θ∗ g = 500 g = 1500 Speedy Poly g = 5000 g = 500 g = 1500 Speedy Poly g = 5000 10 3 10 4 10 5 10 6 100 101 102 103 104 10 3 10 4 10 5 10 6 n Polyak-Ruppert Polyak-RuppertB B n 0 10 20 30 40 50 0 1 2 0 20 40 60 80 100 120 140 160 0 0.5 Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn Zap-Q: Zap-Q: ≡ α0 85 n γn ≡ γn αn WatkinsWatkins BellmanErrorBellmanError Histogramsn=106 2σ confidence intervals for the Q-learning algorithms 20 / 31
  • 71. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10-3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Asymptotic covariance is infinite λ > − 1 2 Real λi(A) 21 / 31
  • 72. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 Real for every eigenvalue λ Authors observed slow convergence Proposed a matrix gain sequence (see refs for details) Asymptotic covariance is infinite λ > − 1 2 Real λi(A) {Gn} 21 / 31
  • 73. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100 Parameterized Q-function: Qθ with θ ∈ R10 i 0 1 2 3 4 5 6 7 8 9 10 -10 0 -10 -1 -10 -2 -10 -3 -10-4 -10-5 -10 -6 -0.525-30 -25 -20 -15 -10 -5 -10 -5 0 5 10 Re (λ(GA)) Co(λ(GA)) λi(GA)Real λi(A) Eigenvalues of A and GA for the finance example Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1 2 21 / 31
  • 74. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Zap-Q G-Q -1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800 -250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300 Theoritical pdf Experimental pdf Empirical: 1000 trials Wn = √ n˜θn Entry #1: n = 2 × 106 Entry #7: n = 2 × 106 22 / 31
  • 75. Zap Q-Learning Optimal stopping Zap Q-Learning Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance State space: R100. Parameterized Q-function: Qθ with θ ∈ R10 Histograms of the average reward obtained using the different algorithms: 1 1.05 1.1 1.15 1.2 1.25 0 20 40 60 80 100 1 1.05 1.1 1.15 1.2 1.25 0 100 200 300 400 500 600 1 1.05 1.1 1.15 1.2 1.25 0 5 10 15 20 25 30 35 G-Q(0) G-Q(0) Zap-Q Zap-Q ρ = 0.8 ρ = 1.0 g = 100 g = 200 Zap-Q ρ = 0.85 n = 2 × 104 n = 2 × 105 n = 2 × 106 Zap-Q G-Q 23 / 31
  • 76. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance 24 / 31
  • 77. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance 24 / 31
  • 78. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters 24 / 31
  • 79. Conclusions & Future Work Conclusions & Future Work Conclusions Reinforcement Learning is not just cursed by dimension, but also by variance We need better design tools to improve performance The asymptotic covariance is an awesome design tool. It is also predictive of finite-n performance. Example: g∗ = 1500 was chosen based on asymptotic covariance Future work: Q-learning with function-approximation Obtain conditions for a stable algorithm in a general setting Optimal stopping time problems Adaptive optimization of algorithm parameters Finite-time analysis 24 / 31
  • 80. Conclusions & Future Work Thank you! thankful 25 / 31
  • 81. References Control Techniques FOR Complex Networks Sean Meyn Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419 Markov Chains and Stochastic Stability S. P. Meyn and R. L. Tweedie August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009 π(f)<∞ ∆V (x) ≤ −f(x) + bIC(x) Pn (x, · ) − π f → 0 sup C Ex[SτC(f)]<∞ References 26 / 31
  • 82. References This lecture A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural Information Processing Systems (NIPS). Dec. 2017. A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available on ArXiv. Jul. 2017. 27 / 31
  • 83. References Selected References I [1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017. [2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag, Berlin, 1990. Translated from the French by Stephen S. Wilson. [3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK, 2008. [4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000. [5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge University Press, Cambridge, second edition, 2009. Published in the Cambridge Mathematical Library. [6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007. See last chapter on simulation and average-cost TD learning 28 / 31
  • 84. References Selected References II [7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure. The Annals of Statistics, 13(1):236–245, 1985. [8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes. Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research and Industrial Engineering, Ithaca, NY, 1988. [9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages 98–107, 1990. [10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging. SIAM J. Control Optim., 30(4):838–855, 1992. [11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic approximation. Ann. Appl. Probab., 14(2):796–819, 2004. [12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation algorithms for machine learning. In Advances in Neural Information Processing Systems 24, pages 451–459. Curran Associates, Inc., 2011. 29 / 31
  • 85. References Selected References III [13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010. [14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, Cambridge, UK, 1989. [15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992. [16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn., 3(1):9–44, 1988. [17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997. [18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997. [19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In Advances in Neural Information Processing Systems, 2011. [20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning Research, 5(Dec):1–25, 2003. 30 / 31
  • 86. References Selected References IV [21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and Approximate Dynamic Programming for Feedback Control. Wiley, 2011. [22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space theory, approximation algorithms, and an application to pricing high-dimensional financial derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999. [23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and Applications, 16(2):207–239, 2006. [24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference learning. Mach. Learn., 22(1-3):33–57, 1996. [25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn., 49(2-3):233–246, 2002. [26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003. [27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE Conference on Decision and Control, pages 3598–3605, Dec. 2009. 31 / 31