深層強化学習入門　2020年度Deep Learning基礎講座「強化学習」

•
•
•
•
•
•
•
•
•

•
•
•
10360
1060
10120
10220

•
•
•
st+1
st at
p(st+1 |st, at)
st at st+1
rt rt+1 rt+2
at 1 at at+1
st st+1

•
•
•
•
st
ot
at st
rt ot rt+1 ot+1 rt+2
at 1 at at+1
st st+1

•
•
•
•
•
•
•
rt+1 st at
{s0, a0, r1, s1, a1, r2, ⋯}
Rt {rt+1, rt+2, ⋯}
Rt = rt+1 + γrt+2 + γ2
rt+3⋯ =
∞
∑
k=0
γk
rt+k+1
γ (0 ≤ γ ≤ 1)
𝔼 [Rt]
γ

•
•
•
•
π(s)
s a
Vπ
(s)
π s
Qπ
(s, a)
π s a

•
•
Rt
Vπ
(s) = 𝔼 [Rt |st = s] = 𝔼
[
∞
∑
k=0
γk
rt+k+1 |st = s
]
Qπ
(s, a) = 𝔼 [Rt |st = s, at = a] = 𝔼
[
∞
∑
k=0
γk
rt+k+1 |st = s, at = a
]

•
•
Aπ
= Qπ
− Vπ
π s a

•
•
•
•
•
s a
μ
s a
μ

•
•
•
•
•
Qπ
(st, at) ← Qπ
(st, at) + αδt
δt = yt − Qπ
(st, at)
α δt yt

•
•
•
•
•
ySARSA
t = rt+1 + γQπ
(st+1, at+1)
Qπ
(st, at) ← Qπ
(st, at) + α (rt+1 + γQπ
(st+1, at+1) − Qπ
(st, at))
π
{st, at, rt, st+1, at+1}

•
•
•
yQ−learning
t = rt+1 + γ max
a′
Qπ
(st+1, a′)
Qπ
(st, at) ← Qπ
(st, at) + α
(
rt+1 + γ max
a′
Qπ
(st+1, a′) − Qπ
(st, at)
)

•
•
•
•
a = arg maxa′
Q(s, a′)
ϵ
ϵ 1 − ϵ

•
•
Qπ
(s, a) = 𝔼π,p
[
∞
∑
k=0
γk
]
= 𝔼π,p
[
rt+1 + γ
∞
∑
k=0
γk
]
= 𝔼s′∼p r(s, a, s′) + γ𝔼a′∼π 𝔼π,p
[
∞
∑
k=0
γk
r(t+1)+k+1 |st+1 = s′, at+1 = a′
]
|st = s, at = a
= 𝔼s′∼p [r(s, a, s′) + γ𝔼a′∼π [Qπ
(s′, a′))]]
Rt+1
Qπ
(s, a) = 𝔼 [Rt |st = s, at = a]

•
• π
Qπ
(s, a) = 𝔼s′∼p [r(s, a, s′) + γ𝔼a′∼π [Qπ
(s′, a′))]]
= 𝔼s′∼p [
r(s, a, s′) + γ max
a′
[Qπ
(s′, a′))]]

•
•
•
•
• 🤔
s0
s
Point in
state space
to be
represented
Tiling 1
Tiling 2
Tiling 3
Tiling 4
Continuous
2D state
space
Four active
tiles/features
overlap the point
and are used to
represent it

•
•
•
•
•
• Vπ
(s), Qπ
(s . a) π

•
•
•
•
•
•
•
Qπ
(st, at) ← Qπ
(st, at) + α
(
rt+1 + γ max
a′
Qπ
(st+1, a′) − Qπ
(st, at)
)

•
•
•
•
•
•
•
255210×160×3

•
•
•
Qπ
(st, at) ← Qπ
(st, at) + α
(
rt+1 + γ max
a′
Qπ
(st+1, a′) − Qπ
(st, at)
)
max
a′
Qπ
(st+1, a′)
max
a′
Qπ
(st+1, a′)
yQ−learning
t = rt+1 + γ max
a′
Qπ
(st+1, a′)

•
•
•
•
Qπ
(s, a) = 𝔼s′∼p [
r(s, a, s′) + γ max
a′
[Qπ
(s′, a′))]]
p π

•
•
•
•
πθ
𝒥(πθ) = 𝔼πθ [fπθ
( ⋅ )] θ
fπθ
( ⋅ )
Rt
Vπ
, Qπ
∇θ 𝒥(πθ) = 𝔼πθ [∇θlog πθ ⋅ fπθ
( ⋅ )]

•
•
•
•
•
•
•
•
Rt fπθ
( ⋅ ) = Rt
Rt
Rt b(s)
fπθ
( ⋅ ) = Rt − b(s)
b(s)

•
•
•
•
Vπ
, Qπ
π
Rt b(s)
Vπ
, Qπ
Rt Qπ
b(s) Vπ
Aπ
= Qπ
− Vπ

•
•
•
•
•
•
•
•

•
•
•
•
•
a = arg maxa′
Q(s, a′)

•
•
•
a = arg maxa′
Q(s, a′)

•
Y LeCun
How Much Information Does the Machine Need to Predict?
“Pure” Reinforcement Learning (cherry)
The machine predicts a scalar
reward given once in a while.
A few bits for some samples
Supervised Learning (icing)
The machine predicts a category
or a few numbers for each input
Predicting human-supplied data
10 10,000 bits per sample→
Unsupervised/Predictive Learning (cake)
The machine predicts any part of
its input for any observed part.
Predicts future frames in videos
Millions of bits per sample
(Yes, I know, this picture is slightly offensive to RL folks. But I’ll make it up)

•
•
•
•
•
•
•
•
•
•

深層強化学習入門　2020年度Deep Learning基礎講座「強化学習」

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 深層強化学習入門　2020年度Deep Learning基礎講座「強化学習」

Similar to 深層強化学習入門　2020年度Deep Learning基礎講座「強化学習」 (20)

Recently uploaded

Recently uploaded (20)