Reinforcement learning: hidden theory, and new super-fast algorithms
Lecture presented at the Center for Systems and Control (CSC@USC) and Ming Hsieh Institute for Electrical Engineering,
February 21, 2018
Stochastic Approximation algorithms are used to approximate solutions to fixed point equations that involve expectations of functions with respect to possibly unknown distributions. The most famous examples today are TD- and Q-learning algorithms. The first half of this lecture will provide an overview of stochastic approximation, with a focus on optimizing the rate of convergence. A new approach to optimize the rate of convergence leads to the new Zap Q-learning algorithm. Analysis suggests that its transient behavior is a close match to a deterministic Newton-Raphson implementation, and numerical experiments confirm super fast convergence.
Based on
@article{devmey17a,
Title = {Fastest Convergence for {Q-learning}},
Author = {Devraj, Adithya M. and Meyn, Sean P.},
Journal = {NIPS 2017 and ArXiv e-prints},
Year = 2017}
Vip Mumbai Call Girls Marol Naka Call On 9920725232 With Body to body massage...
Introducing Zap Q-Learning
1. Zap Q-Learning
Reinforcement Learning: Hidden Theory, and New Super-Fast Algorithms
Center for Systems and Control (CSC@USC)
and Ming Hsieh Institute for Electrical Engineering
February 21, 2018
Adithya M. Devraj Sean P. Meyn
Department of Electrical and Computer Engineering — University of Florida
4. Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
1 / 31
5. Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 / 31
6. Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
1 / 31
7. Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
1 / 31
8. Stochastic Approximation Basic Algorithm
What is Stochastic Approximation?
A simple goal: Find the solution θ∗ to
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
What makes this hard?
1 The function f and the distribution of the random vector W may not
be known
– we may only know something about the structure of the problem
2 Even if everything is known, computation of the expectation may be
expensive. For root finding, we may need to compute the expectation
for many values of θ
3 Motivates stochastic approximation: θ(n + 1) = θ(n) + αnf(θ(n), W(n))
The recursive algorithms we come up with are often slow, and their
variance may be infinite: typical in Q-learning [Devraj & M 2017]
1 / 31
9. Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
2 / 31
10. Stochastic Approximation ODE Method
Algorithm and Convergence Analysis
Algorithm:
θ(n + 1) = θ(n) + αnf(θ(n), W(n))
Goal:
¯f(θ∗
) := E[f(θ, W)]
θ=θ∗
= 0
Interpretation: θ∗ ≡ stationary point of the ODE
d
dt
ϑ(t) = ¯f(ϑ(t))
Analysis: Stability of the ODE ⊕ (See Borkar’s monograph) =⇒
lim
n→∞
θ(n) = θ∗
2 / 31
11. Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable:
η = c(x) fX(x) dx
3 / 31
12. Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
3 / 31
13. Stochastic Approximation SA Example
Stochastic Approximation Example
Example: Monte-Carlo
Monte-Carlo Estimation
Estimate the mean η = c(X), where X is a random variable
SA interpretation: Find θ∗ solving 0 = E[f(θ, X)] = E[c(X) − θ]
Algorithm: θ(n) =
1
n
n
i=1
c(X(i))
=⇒ (n + 1)θ(n + 1) =
n+1
i=1
c(X(i)) = nθ(n) + c(X(n + 1))
=⇒ (n + 1)θ(n + 1) = (n + 1)θ(n) + [c(X(n + 1)) − θ(n)]
SA Recursion: θ(n + 1) = θ(n) + αnf(θ(n), X(n + 1))
αn = ∞, α2
n < ∞
3 / 31
16. Fastest Stochastic Approximation Algorithm Performance
Asymptotic Covariance
Σ = lim
n→∞
Σn = lim
n→∞
nE ˜θ(n)˜θ(n)T
,
√
n˜θ(n) ≈ N(0, Σ)
SA recursion for covariance:
Σn+1 ≈ Σn + 1
n (A + 1
2I)Σn + Σn(A + 1
2I)T
+ Σ∆
A = d
dθ
¯f (θ∗)
Conclusions
1 If Re λ(A) ≥ −1
2 for some eigenvalue then Σ is (typically) infinite
2 If Re λ(A) < −1
2 for all, then Σ = limn→∞ Σn is the unique solution
to the Lyapunov equation:
0 = (A + 1
2I)Σ + Σ(A + 1
2I)T
+ Σ∆
5 / 31
17. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
6 / 31
18. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
6 / 31
19. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
6 / 31
20. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance
Introduce a d × d matrix gain sequence {Gn}:
θ(n + 1) = θ(n) +
1
n + 1
Gnf(θ(n), X(n))
Assume it converges, and linearize:
˜θ(n + 1) ≈ ˜θ(n) +
1
n + 1
G A˜θ(n) + ∆(n + 1) , A =
d
dθ
¯f (θ∗
) .
If G = G∗ := −A−1 then
Resembles Monte-Carlo estimate
Resembles Newton-Rapshon
It is optimal: Σ∗ = G∗Σ∆G∗T
≤ ΣG any other G
Polyak-Ruppert averaging is also optimal, but first two bullets are missing.
6 / 31
21. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
7 / 31
22. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
Example: return to Monte-Carlo
θ(n + 1) = θ(n) +
g
n + 1
−θ(n) + X(n + 1)
∆(n) = X(n) − E[X(n)]
7 / 31
23. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
7 / 31
24. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
7 / 31
25. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5 g
σ2
∆
Σ =
σ2
∆
2
g2
g − 1/2
Asymptotic variance as a function of g
7 / 31
26. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Variance
∆(n) = X(n) − E[X(n)]Normalization for analysis:
˜θ(n + 1) = ˜θ(n) +
g
n + 1
−˜θ(n) + ∆(n + 1)
Example: X(n) = W2(n), W ∼ N(0, 1)
0 1 2 3 4 5
t 104
0.4
0.6
0.8
1
1.2
(t)
20 30.8
10 15.8
1 3
0.5
0.1
g
SA estimates of E[W2
], W ∼ N(0, 1)
7 / 31
27. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
Requires An ≈ A(θn) :=
d
dθ
¯f (θn)
8 / 31
28. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
8 / 31
29. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
8 / 31
30. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
8 / 31
31. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
8 / 31
32. Fastest Stochastic Approximation Stochastic Newton Raphson
Optimal Asymptotic Covariance and Zap-SNR
Zap-SNR (designed to emulate deterministic Newton-Raphson)
θ(n + 1) = θ(n) + αn(−An)−1
f(θ(n), X(n))
An = An−1 + γn(An − An−1), An =
d
dθ
f(θ(n), X(n))
An ≈ A(θn) requires high-gain,
γn
αn
→ ∞, n → ∞
Always: αn = 1/n. Numerics that follow: γn = (1/n)ρ, ρ ∈ (0.5, 1)
ODE for Zap-SNR
d
dt
xt = (−A(xt))−1 ¯f (xt), A(x) =
d
dx
¯f (x)
Not necessarily stable (just like in deterministic Newton-Raphson)
General conditions for convergence is open 8 / 31
35. Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
9 / 31
36. Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Φ(n) = (state, action)
9 / 31
37. Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
9 / 31
38. Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
9 / 31
39. Reinforcement Learning RL & SA
SA and RL Design
Functional equations in Stochastic Control
Always of the form
0 = E[F(h∗
, Φ(n + 1)) | Φ0 . . . Φ(n)] , h∗
= ?
Galerkin relaxation:
0 = E[F(hθ∗
, Φ(n + 1))ζn] , θ∗
= ?
Necessary Ingredients:
Parameterized family {hθ : θ ∈ Rd}
Adapted, d-dimensional stochastic process {ζn}
Examples are TD- and Q-Learning
These algorithms are thus special cases of stochastic approximation
(as we all know)
9 / 31
40. Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
10 / 31
41. Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
10 / 31
42. Reinforcement Learning MDP Theory
Stochastic Optimal Control
MDP Model
X is a stationary controlled Markov chain, with input U
For all states x and sets A,
P{X(n + 1) ∈ A | X(n) = x, U(n) = u, and prior history} = Pu(x, A)
c: X × U → R is a cost function
β < 1 a discount factor
Value function:
h∗
(x) = min
U
∞
n=0
βn
E[c(X(n), U(n)) | X(0) = x]
Bellman equation:
h∗
(x) = min
u
{c(x, u) + βE[h∗
(X(n + 1)) | X(n) = x, U(n) = u]}
10 / 31
51. Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
13 / 31
52. Reinforcement Learning Q-Learning
Watkins’ Q-learning
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
53. Reinforcement Learning Q-Learning
Watkins’ Q-learning
Big Question: Can we Zap Q-Learning?
Find θ∗ that solves
E c(X(n), U(n)) + βQθ∗
((X(n + 1)) − Qθ∗
((X(n), U(n)) ζn = 0
Watkin’s algorithm is Stochastic Approximation
The family {Qθ} and eligibility vectors {ζn} in this design:
Linearly parameterized family of functions: Qθ(x, u) = θT
ψ(x, u)
ζn ≡ ψ(Xn, Un) and
ψn(x, u) = 1{x = xn, u = un} (complete basis)
Asymptotic covariance is typically infinite
13 / 31
66. Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Example: Stochastic Shortest Path
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
Watkins’ algorithm has infinite asymptotic covariance with αn = 1/n
Optimal scalar gain is approximately αn = 1500/n
0 1 2 3 4 5 6 7 8 9 10 105
0
20
40
60
80
100 Watkins, Speedy Q-learning,
Polyak-Ruppert Averaging
Zap
BellmanError
n
Watkins, g = 1500
Zap, γn = αn
Convergence of Zap-Q Learning
Discount factor: β = 0.99
17 / 31
67. Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
18 / 31
68. Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2
Convergence with Zap gain γn = n−0.85
-2 -1 0 1 -2 -1 0 1 -8 -6 -4 -2 0 2 4 6 8 103-8 -6 -4 -2 0 2 4 6 8
n = 104
n = 106
Theoritical pdf Experimental pdf Empirical: 1000 trialsWn =
√
n˜θn
Entry #18: n = 104
n = 106
Entry #10:
CLT gives good prediction of finite-n performance
Discount factor: β = 0.99
19 / 31
69. Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
Polyak-Ruppert
B0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Watkins
BellmanError
Histogramsn=106
20 / 31
70. Zap Q-Learning Watkin’s algorithm
Zap Q-Learning
Optimize Walk to Cafe
1
4
65
3 2Local Convergence: θ(0) initialized in neighborhood of θ∗
g = 500
g = 1500
Speedy
Poly
g = 5000
g = 500
g = 1500
Speedy
Poly
g = 5000
10
3
10
4
10
5
10
6
100
101
102
103
104
10
3
10
4
10
5
10
6
n
Polyak-Ruppert
Polyak-RuppertB
B
n
0 10 20 30 40 50
0
1
2
0 20 40 60 80 100 120 140 160
0
0.5
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
Zap-Q:
Zap-Q: ≡ α0 85
n
γn ≡
γn
αn
WatkinsWatkins
BellmanErrorBellmanError
Histogramsn=106
2σ confidence intervals for the Q-learning algorithms
20 / 31
71. Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
21 / 31
72. Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
Real for every eigenvalue λ
Authors observed slow convergence
Proposed a matrix gain sequence
(see refs for details)
Asymptotic covariance is infinite
λ > −
1
2
Real λi(A)
{Gn}
21 / 31
73. Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100
Parameterized Q-function: Qθ with θ ∈ R10
i
0 1 2 3 4 5 6 7 8 9 10
-10
0
-10
-1
-10
-2
-10
-3
-10-4
-10-5
-10
-6
-0.525-30 -25 -20 -15 -10 -5
-10
-5
0
5
10
Re (λ(GA))
Co(λ(GA))
λi(GA)Real λi(A)
Eigenvalues of A and GA for the finance example
Favorite choice of gain in [23] barely meets the criterion Re(λ(GA)) < −1
2
21 / 31
74. Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100. Parameterized Q-function: Qθ with θ ∈ R10
Zap-Q
G-Q
-1000 0 1000 2000 3000 -600 -400 -200 0 200 400 600 800
-250 -200 -150 -100 -50 0 50 100 -200 -100 0 100 200 300
Theoritical pdf Experimental pdf Empirical: 1000 trials
Wn =
√
n˜θn
Entry #1: n = 2 × 106
Entry #7: n = 2 × 106
22 / 31
75. Zap Q-Learning Optimal stopping
Zap Q-Learning
Model of Tsitsiklis and Van Roy: Optimal Stopping Time in Finance
State space: R100.
Parameterized Q-function: Qθ with θ ∈ R10
Histograms of the average reward obtained using the different algorithms:
1 1.05 1.1 1.15 1.2 1.25
0
20
40
60
80
100
1 1.05 1.1 1.15 1.2 1.25
0
100
200
300
400
500
600
1 1.05 1.1 1.15 1.2 1.25
0
5
10
15
20
25
30
35 G-Q(0)
G-Q(0)
Zap-Q
Zap-Q ρ = 0.8
ρ = 1.0
g = 100
g = 200
Zap-Q ρ = 0.85
n = 2 × 104
n = 2 × 105
n = 2 × 106
Zap-Q G-Q
23 / 31
76. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
24 / 31
77. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
24 / 31
78. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
24 / 31
79. Conclusions & Future Work
Conclusions & Future Work
Conclusions
Reinforcement Learning is not just cursed by dimension,
but also by variance
We need better design tools to improve performance
The asymptotic covariance is an awesome design tool.
It is also predictive of finite-n performance.
Example: g∗
= 1500 was chosen based on asymptotic covariance
Future work:
Q-learning with function-approximation
Obtain conditions for a stable algorithm in a general setting
Optimal stopping time problems
Adaptive optimization of algorithm parameters
Finite-time analysis
24 / 31
81. References
Control Techniques
FOR
Complex Networks
Sean Meyn
Pre-publication version for on-line viewing. Monograph available for purchase at your favorite retailer
More information available at http://www.cambridge.org/us/catalogue/catalogue.asp?isbn=9780521884419
Markov Chains
and
Stochastic Stability
S. P. Meyn and R. L. Tweedie
August 2008 Pre-publication version for on-line viewing. Monograph to appear Februrary 2009
π(f)<∞
∆V (x) ≤ −f(x) + bIC(x)
Pn
(x, · ) − π f → 0
sup
C
Ex[SτC(f)]<∞
References
26 / 31
82. References
This lecture
A. M. Devraj and S. P. Meyn, Zap Q-learning. Advances in Neural
Information Processing Systems (NIPS). Dec. 2017.
A. M. Devraj and S. P. Meyn, Fastest convergence for Q-learning. Available
on ArXiv. Jul. 2017.
27 / 31
83. References
Selected References I
[1] A. M. Devraj and S. P. Meyn. Fastest convergence for Q-learning. ArXiv , July 2017.
[2] A. Benveniste, M. M´etivier, and P. Priouret. Adaptive algorithms and stochastic
approximations, volume 22 of Applications of Mathematics (New York). Springer-Verlag,
Berlin, 1990. Translated from the French by Stephen S. Wilson.
[3] V. S. Borkar. Stochastic Approximation: A Dynamical Systems Viewpoint. Hindustan
Book Agency and Cambridge University Press (jointly), Delhi, India and Cambridge, UK,
2008.
[4] V. S. Borkar and S. P. Meyn. The ODE method for convergence of stochastic
approximation and reinforcement learning. SIAM J. Control Optim., 38(2):447–469, 2000.
[5] S. P. Meyn and R. L. Tweedie. Markov chains and stochastic stability. Cambridge
University Press, Cambridge, second edition, 2009. Published in the Cambridge
Mathematical Library.
[6] S. P. Meyn. Control Techniques for Complex Networks. Cambridge University Press, 2007.
See last chapter on simulation and average-cost TD learning
28 / 31
84. References
Selected References II
[7] D. Ruppert. A Newton-Raphson version of the multivariate Robbins-Monro procedure.
The Annals of Statistics, 13(1):236–245, 1985.
[8] D. Ruppert. Efficient estimators from a slowly convergent Robbins-Monro processes.
Technical Report Tech. Rept. No. 781, Cornell University, School of Operations Research
and Industrial Engineering, Ithaca, NY, 1988.
[9] B. T. Polyak. A new method of stochastic approximation type. Avtomatika i
telemekhanika (in Russian). translated in Automat. Remote Control, 51 (1991), pages
98–107, 1990.
[10] B. T. Polyak and A. B. Juditsky. Acceleration of stochastic approximation by averaging.
SIAM J. Control Optim., 30(4):838–855, 1992.
[11] V. R. Konda and J. N. Tsitsiklis. Convergence rate of linear two-time-scale stochastic
approximation. Ann. Appl. Probab., 14(2):796–819, 2004.
[12] E. Moulines and F. R. Bach. Non-asymptotic analysis of stochastic approximation
algorithms for machine learning. In Advances in Neural Information Processing Systems
24, pages 451–459. Curran Associates, Inc., 2011.
29 / 31
85. References
Selected References III
[13] C. Szepesv´ari. Algorithms for Reinforcement Learning. Synthesis Lectures on Artificial
Intelligence and Machine Learning. Morgan & Claypool Publishers, 2010.
[14] C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College,
Cambridge, Cambridge, UK, 1989.
[15] C. J. C. H. Watkins and P. Dayan. Q-learning. Machine Learning, 8(3-4):279–292, 1992.
[16] R. S. Sutton.Learning to predict by the methods of temporal differences. Mach. Learn.,
3(1):9–44, 1988.
[17] J. N. Tsitsiklis and B. Van Roy. An analysis of temporal-difference learning with function
approximation. IEEE Trans. Automat. Control, 42(5):674–690, 1997.
[18] C. Szepesv´ari. The asymptotic convergence-rate of Q-learning. In Proceedings of the 10th
Internat. Conf. on Neural Info. Proc. Systems, pages 1064–1070. MIT Press, 1997.
[19] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. In
Advances in Neural Information Processing Systems, 2011.
[20] E. Even-Dar and Y. Mansour. Learning rates for Q-learning. Journal of Machine Learning
Research, 5(Dec):1–25, 2003.
30 / 31
86. References
Selected References IV
[21] D. Huang, W. Chen, P. Mehta, S. Meyn, and A. Surana. Feature selection for
neuro-dynamic programming. In F. Lewis, editor, Reinforcement Learning and
Approximate Dynamic Programming for Feedback Control. Wiley, 2011.
[22] J. N. Tsitsiklis and B. Van Roy. Optimal stopping of Markov processes: Hilbert space
theory, approximation algorithms, and an application to pricing high-dimensional financial
derivatives. IEEE Trans. Automat. Control, 44(10):1840–1851, 1999.
[23] D. Choi and B. Van Roy. A generalized Kalman filter for fixed point approximation and
efficient temporal-difference learning. Discrete Event Dynamic Systems: Theory and
Applications, 16(2):207–239, 2006.
[24] S. J. Bradtke and A. G. Barto. Linear least-squares algorithms for temporal difference
learning. Mach. Learn., 22(1-3):33–57, 1996.
[25] J. A. Boyan. Technical update: Least-squares temporal difference learning. Mach. Learn.,
49(2-3):233–246, 2002.
[26] A. Nedic and D. Bertsekas. Least squares policy evaluation algorithms with linear function
approximation. Discrete Event Dyn. Systems: Theory and Appl., 13(1-2):79–110, 2003.
[27] P. G. Mehta and S. P. Meyn. Q-learning and Pontryagin’s minimum principle. In IEEE
Conference on Decision and Control, pages 3598–3605, Dec. 2009.
31 / 31