More Related Content
Similar to 強化学習勉強会6の資料 (20)
強化学習勉強会6の資料
- 3. M = {S, A, pT, p0, g}
Pr{St+1 = s′

|At = a, St = s, …} = Pr{St+1 = s′

|At = a, St = s}
=: pT(s′

|s, a), Pr(S0 = s) =: p0(s)
π ∈ ΠM
Pr(At = a|St = s, …) = Pr(At = a|St = s)
=: π(a|s)
V*
Vπ
(s) :=
𝔼
π
[C0 |S0 = s], Ct :=
∞
∑
i=0
γi
g(At+i, St+i), γ ∈ [0,1)
f(π)
f(π) :=
∑
s∈S
p0(s)Vπ
(s)
π∈ΠM
f(π) M
- 4. V* = max
π∈ΠM
Vπ
= max
a∈A
(g(a, ⋅ ) + γ
∑
s′

∈S
pT(s′

| ⋅ ,a)V*(s′

))
= B*(V*)
⇒ V*
π*
π*d
= arg max
a∈A
g(a, ⋅ ) + γ
∑
s′

∈S
pT(s′

| ⋅ ,a)V*(s′

)
B*
⇔ ∥B*(v) − B*(u)∥ ≤ γ∥v − u∥
vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
- 5. Bπ
Vπ
(s):=
∑
a∈A
π(a|s)[g(a, s) + γ
∑
s′

∈S
pT(s′

|s, a)Vπ
(s′

)]
=
𝔼
π
[g(St, At) + γVπ
(St+1) St = s, ]
B*V*(s):= max
a∈A
(g(a, s) + γ
∑
s′

∈S
pT(s′

|s, a)V*(s′

))
= max
π∈ΠM
𝔼
π
[g(St, At) + γV*(St+1) St = s]
- 7. Ct
Ct :=
∞
∑
i=0
γi
g(At+i, St+i), γ ∈ [0,1)
Vπ
Vπ
(s) :=
𝔼
π
[C0 |S0 = s]
V*
Vπ
(s) := max
π∈ΠM
𝔼
π
[C0 |S0 = s]
Qπ
Qπ
(s, a) :=
𝔼
π
[C0 |S0 = s, A0 = a]
Q*
Q*(s, a) := max
π∈ΠM
𝔼
π
[C0 |S0 = s, A0 = a]
Vπ
(s) =
∑
a∈A
Qπ
(s, a)π(a|s), V*(s) = max
a∈A
Q*(s, a)
π*d
= arg max
a∈A
Q*( ⋅ , a)
Υπ
Qπ
(s, a):=
𝔼
π
[g(St, At) + γQπ
(St+1, At+1) St = s, At = a]
= g(s, a) + γ
∑
s′

,a′

∈S×A
pT(s′

|s, a)π(a′

|s′

)Qπ
(s′

, a′

)
Υ*Q*(s, a):=
𝔼
π
[g(St, At) + γ max
a′

∈A
Q*(St+1, a′

) St = s, At = a]
= g(s, a) + γ max
a′

∈A ∑
s′

∈S
pT(s′

|s, a)π(a′

|s′

)Q*(s′

, a′

)
- 8. Υπ
(q) = g( ⋅ ) + γ
∑
s′

,a′

∈S×A
pT(s′

| ⋅ )π(a′

|s′

)q(s′

, a′

)
Υ*(q) = g( ⋅ ) + γ max
a′

∈A ∑
s′

∈S
pT(s′
| ⋅ )π(a′

|s′

)q(s′

, a′

)
q, q′

: S × A → ℝ
q ≤ q′

⇔ q(s, a) ≤ q′

(s, a), ∀s, a ∈ S × A
∥q − q′

∥ := max
s,a∈S×A
|q(s, a) − q′

(s, a)|
q ≤ q′

⇒ Υ(q) ≤ Υ(q′

)
Υ(q + c) = Υ(q) + γc, ∀c ∈ ℝ
⇔ ∥Υ(q) − Υ(q′

)∥ ≤ γ∥q − q′

∥
qk+1 = Υ*(qk), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
π*d
= arg max
a∈A
Q*( ⋅ , a)
- 10. Hπ
t := {S0, A0, R0, …, St−1, At−1, Rt−1, At M(π)}
hπ
t := {s0, a0, r0, …, st−1, at−1, rt−1, st M(π)}
̂Υπ
(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γq(st+1, at+1)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀 {a=at} > 0
q(s, a),
̂Υ*(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γ maxa∈A q(st+1, a′

)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
Υπ
Qπ
(s, a):=
𝔼
π
[g(St, At) + γQπ
(St+1, At+1) St = s, At = a]
Υ*Q*(s, a):=
𝔼
π
[g(St, At) + γ max
a′

∈A
Q*(St+1, a′

) St = s, At = a]
- 11. lim
T→∞
1
T
T
∑
i=1
Pr(St = s, At = a|M(π)) > 0, ∀(s, a) ∈ S × A
̂Υπ
( ⋅ ; hT) → Υπ
, ̂Υ*( ⋅ ; hT) → Υ* T → ∞
q ≤ q′

⇒ ̂Υ(q) ≤ ̂Υ(q′

)
̂Υ(q + c) = ̂Υ(q) + γc, ∀c ∈ ℝ
⇔ ∥ ̂Υ(q) − ̂Υ(q′

)∥ ≤ γ∥q − q′

∥
qk+1 = ̂Υ*(qk), q0 ∈ Rn×m
⇒ qk → ̂Q* k → ∞
̂π*d
= arg max
a∈A
̂Q*( ⋅ , a)
̂Υπ
(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γq(st+1, at+1)
)
∑
T−1
t=0
𝕀 {s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
̂Υ*(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γ maxa∈A q(st+1, a′
)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
- 13. qk+1 = ̂Υ*(qk : hπ
∞), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1}),
𝔼
[∥q0∥] ≤ const
αt ≥ 0, ∀t ∈ ℤ≥0
∑
t∈ℤ≥0
αt
𝕀
{s=st}
𝕀
{a=at} = ∞, ∀(s, a) ∈ S × A
∑
t∈ℤ≥0
α2
t
𝕀
{s=st}
𝕀
{a=at} < ∞, ∀(s, a) ∈ S × A
lim
t→∞
𝔼
[∥qt − Q*∥2
] = 0
- 14. qk+1 = ̂Υ*(qk : hπ
∞), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1}),
𝔼
[∥q0∥] ≤ const
at ∼ π( ⋅ |st)
rt, st+1 ∼ g(st, at), pT( ⋅ : st, at)
̂qt+1(st, at) = ̂qt+1(st, at) + αt(rt + γ max
a′

∈A
̂qt(st+1, at) − ̂q(st, at))
π*d
= arg max
a∈A
̂q∞( ⋅ , a)
- 16. vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
xk+1 = ft(xk)
x*
ft(x*) = 0
lim
t→∞
∥xt − x*∥ = 0
- 17. vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
xk+1 = ft(xk, ω)
x*
ft(x*, ω) = 0, ∀ω ∈ Ω
lim
t→∞
E[∥xt − x*∥2
] = 0
- 18. qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
= (1 − αt)qt + αt(Υ*(qt) + Xt)
Xt := ̂Υ*(qt : {St, At, Rt, St+1}) − Υ*(qt)
𝔼
[Xt] = 0,
𝔼
[∥Xt∥2
] ≤ const