115. POMDP + Optimality Variables
N
xt
ot
at
rt
••••••
st
xt+1
ot+1
at+1
rt+1
st+1
116. POMDP
POMDP p (at ∣ st, o≥t)
x s p (st ∣ xt, st−1, at−1)
p (s≤t, at ∣ x≤t, a<t, o≥t)
= p (at ∣ st, o≥t) p (s1 ∣ x1)
t
∏
τ=1
p (sτ+1 ∣ xτ+1, sτ, aτ)
117. p (s≤t, at ∣ x≤t, a<t, o≥t)
qϕ (s≤t, at ∣ x≤t, a<t)
= πϕ (at ∣ st) qϕ (s1 ∣ x1)
t
∏
τ=1
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
118. KL divergence
KL (qϕ (s≤t, at ∣ x≤t, a<t) ∥ p (s≤t, at ∣ x≤t, a<t, o≥t))
= 𝔼qϕ
[
log
qϕ (s≤t, at ∣ x≤t, a<t)
p (s≤t, at ∣ x≤t, a<t, o≥t)]
= 𝔼qϕ
[
log πϕ (at ∣ st) + log
qϕ (s1 ∣ x1)
p (x1, s1)
+
t
∑
τ=1
log
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
p (xτ+1, sτ+1 ∣ sτ, aτ)
− Q* (st, at)
]
+log p (x≤t ∣ a<t) + V* (st)
−ℒϕ (x≤t, a<t, o≥t)
119. KL divergence
KL (qϕ (s≤t, at ∣ x≤t, a<t) ∥ p (s≤t, at ∣ x≤t, a<t, o≥t))
= 𝔼qϕ
[
log
qϕ (s≤t, at ∣ x≤t, a<t)
p (s≤t, at ∣ x≤t, a<t, o≥t)]
= 𝔼qϕ
[
log πϕ (at ∣ st) + log
qϕ (s1 ∣ x1)
p (x1, s1)
+
t
∑
τ=1
log
qϕ (sτ+1 ∣ xτ+1, sτ, aτ)
p (xτ+1, sτ+1 ∣ sτ, aτ)
− Q* (st, at)
]
+log p (x≤t ∣ a<t) + V* (st)
−ℒϕ (x≤t, a<t, o≥t)
161. Control as Inference
Reinforcement Learning and Control as Probabilistic Inference: Tutorial and Review
https://arxiv.org/abs/1805.00909
UC Berkeley Deep RL course ( 14 )
http://rail.eecs.berkeley.edu/deeprlcourse-fa19/
162. Control as Inference
Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a
Stochastic Actor https://arxiv.org/abs/1801.01290
Reinforcement Learning with Deep Energy-Based Policies
https://arxiv.org/abs/1702.08165
Stochastic Latent Actor-Critic: Deep Reinforcement Learning with a Latent Variable
Model https://arxiv.org/abs/1907.00953