SlideShare a Scribd company logo
1 of 18
Download to read offline
M = {S, A, pT, p0, g}
Pr{St+1 = s′

|At = a, St = s, …} = Pr{St+1 = s′

|At = a, St = s}
=: pT(s′

|s, a), Pr(S0 = s) =: p0(s)
π ∈ ΠM
Pr(At = a|St = s, …) = Pr(At = a|St = s)
=: π(a|s)
V*
Vπ
(s) :=
𝔼
π
[C0 |S0 = s], Ct :=
∞
∑
i=0
γi
g(At+i, St+i), γ ∈ [0,1)
f(π)
f(π) :=
∑
s∈S
p0(s)Vπ
(s)
π∈ΠM
f(π) M
V* = max
π∈ΠM
Vπ
= max
a∈A
(g(a, ⋅ ) + γ
∑
s′

∈S
pT(s′

| ⋅ ,a)V*(s′

))
= B*(V*)
⇒ V*
π*
π*d
= arg max
a∈A
g(a, ⋅ ) + γ
∑
s′

∈S
pT(s′

| ⋅ ,a)V*(s′

)
B*
⇔ ∥B*(v) − B*(u)∥ ≤ γ∥v − u∥
vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
Bπ
Vπ
(s):=
∑
a∈A
π(a|s)[g(a, s) + γ
∑
s′

∈S
pT(s′

|s, a)Vπ
(s′

)]
=
𝔼
π
[g(St, At) + γVπ
(St+1) St = s, ]
B*V*(s):= max
a∈A
(g(a, s) + γ
∑
s′

∈S
pT(s′

|s, a)V*(s′

))
= max
π∈ΠM
𝔼
π
[g(St, At) + γV*(St+1) St = s]
Ct
Ct :=
∞
∑
i=0
γi
g(At+i, St+i), γ ∈ [0,1)
Vπ
Vπ
(s) :=
𝔼
π
[C0 |S0 = s]
V*
Vπ
(s) := max
π∈ΠM
𝔼
π
[C0 |S0 = s]
Qπ
Qπ
(s, a) :=
𝔼
π
[C0 |S0 = s, A0 = a]
Q*
Q*(s, a) := max
π∈ΠM
𝔼
π
[C0 |S0 = s, A0 = a]
Vπ
(s) =
∑
a∈A
Qπ
(s, a)π(a|s), V*(s) = max
a∈A
Q*(s, a)
π*d
= arg max
a∈A
Q*( ⋅ , a)
Υπ
Qπ
(s, a):=
𝔼
π
[g(St, At) + γQπ
(St+1, At+1) St = s, At = a]
= g(s, a) + γ
∑
s′

,a′

∈S×A
pT(s′

|s, a)π(a′

|s′

)Qπ
(s′

, a′

)
Υ*Q*(s, a):=
𝔼
π
[g(St, At) + γ max
a′

∈A
Q*(St+1, a′

) St = s, At = a]
= g(s, a) + γ max
a′

∈A ∑
s′

∈S
pT(s′

|s, a)π(a′

|s′

)Q*(s′

, a′

)
Υπ
(q) = g( ⋅ ) + γ
∑
s′

,a′

∈S×A
pT(s′

| ⋅ )π(a′

|s′

)q(s′

, a′

)
Υ*(q) = g( ⋅ ) + γ max
a′

∈A ∑
s′

∈S
pT(s′
| ⋅ )π(a′

|s′

)q(s′

, a′

)
q, q′

: S × A → ℝ
q ≤ q′

⇔ q(s, a) ≤ q′

(s, a), ∀s, a ∈ S × A
∥q − q′

∥ := max
s,a∈S×A
|q(s, a) − q′

(s, a)|
q ≤ q′

⇒ Υ(q) ≤ Υ(q′

)
Υ(q + c) = Υ(q) + γc, ∀c ∈ ℝ
⇔ ∥Υ(q) − Υ(q′

)∥ ≤ γ∥q − q′

∥
qk+1 = Υ*(qk), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
π*d
= arg max
a∈A
Q*( ⋅ , a)
Hπ
t := {S0, A0, R0, …, St−1, At−1, Rt−1, At M(π)}
hπ
t := {s0, a0, r0, …, st−1, at−1, rt−1, st M(π)}
̂Υπ
(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γq(st+1, at+1)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀 {a=at} > 0
q(s, a),
̂Υ*(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γ maxa∈A q(st+1, a′

)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
Υπ
Qπ
(s, a):=
𝔼
π
[g(St, At) + γQπ
(St+1, At+1) St = s, At = a]
Υ*Q*(s, a):=
𝔼
π
[g(St, At) + γ max
a′

∈A
Q*(St+1, a′

) St = s, At = a]
lim
T→∞
1
T
T
∑
i=1
Pr(St = s, At = a|M(π)) > 0, ∀(s, a) ∈ S × A
̂Υπ
( ⋅ ; hT) → Υπ
, ̂Υ*( ⋅ ; hT) → Υ* T → ∞
q ≤ q′

⇒ ̂Υ(q) ≤ ̂Υ(q′

)
̂Υ(q + c) = ̂Υ(q) + γc, ∀c ∈ ℝ
⇔ ∥ ̂Υ(q) − ̂Υ(q′

)∥ ≤ γ∥q − q′

∥
qk+1 = ̂Υ*(qk), q0 ∈ Rn×m
⇒ qk → ̂Q* k → ∞
̂π*d
= arg max
a∈A
̂Q*( ⋅ , a)
̂Υπ
(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γq(st+1, at+1)
)
∑
T−1
t=0
𝕀 {s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
̂Υ*(q; hπ
T)(s, a)
:=
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
(rt + γ maxa∈A q(st+1, a′
)
)
∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at}
, ∑
T−1
t=0
𝕀
{s=st}
𝕀
{a=at} > 0
q(s, a),
qk+1 = ̂Υ*(qk : hπ
∞), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1}),
𝔼
[∥q0∥] ≤ const
αt ≥ 0, ∀t ∈ ℤ≥0
∑
t∈ℤ≥0
αt
𝕀
{s=st}
𝕀
{a=at} = ∞, ∀(s, a) ∈ S × A
∑
t∈ℤ≥0
α2
t
𝕀
{s=st}
𝕀
{a=at} < ∞, ∀(s, a) ∈ S × A
lim
t→∞
𝔼
[∥qt − Q*∥2
] = 0
qk+1 = ̂Υ*(qk : hπ
∞), q0 ∈ Rn×m
⇒ qk → Q* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1}),
𝔼
[∥q0∥] ≤ const
at ∼ π( ⋅ |st)
rt, st+1 ∼ g(st, at), pT( ⋅ : st, at)
̂qt+1(st, at) = ̂qt+1(st, at) + αt(rt + γ max
a′

∈A
̂qt(st+1, at) − ̂q(st, at))
π*d
= arg max
a∈A
̂q∞( ⋅ , a)
vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
xk+1 = ft(xk)
x*
ft(x*) = 0
lim
t→∞
∥xt − x*∥ = 0
vk+1 = B*(vk), v0 ∈ Rn
⇒ vk → V* k → ∞
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
xk+1 = ft(xk, ω)
x*
ft(x*, ω) = 0, ∀ω ∈ Ω
lim
t→∞
E[∥xt − x*∥2
] = 0
qt+1 = (1 − αt)qt + αt
̂Υ*(qt : {St, At, Rt, St+1})
= (1 − αt)qt + αt(Υ*(qt) + Xt)
Xt := ̂Υ*(qt : {St, At, Rt, St+1}) − Υ*(qt)
𝔼
[Xt] = 0,
𝔼
[∥Xt∥2
] ≤ const

More Related Content

Similar to 強化学習勉強会6の資料

Наибольшая общая мера: 2500 лет
Наибольшая общая мера: 2500 летНаибольшая общая мера: 2500 лет
Наибольшая общая мера: 2500 лет
sixtyone
 
Responsibility as Indian - Protection of Dharma, Samskriti and Society
Responsibility as Indian - Protection of Dharma, Samskriti and SocietyResponsibility as Indian - Protection of Dharma, Samskriti and Society
Responsibility as Indian - Protection of Dharma, Samskriti and Society
Sajjana Bharathi
 
Diploma - French Diploma
Diploma - French DiplomaDiploma - French Diploma
Diploma - French Diploma
Ilham Aminuddin
 
Data bank KALLARA village vaikom.Kallara grama panchayath - James Joseph adh...
Data bank KALLARA  village vaikom.Kallara grama panchayath - James Joseph adh...Data bank KALLARA  village vaikom.Kallara grama panchayath - James Joseph adh...
Data bank KALLARA village vaikom.Kallara grama panchayath - James Joseph adh...
Jamesadhikaram land matter consultancy 9447464502
 
Fisica matematica final
Fisica matematica finalFisica matematica final
Fisica matematica final
danbohe
 

Similar to 強化学習勉強会6の資料 (20)

raseswara.compressed
raseswara.compressedraseswara.compressed
raseswara.compressed
 
Наибольшая общая мера: 2500 лет
Наибольшая общая мера: 2500 летНаибольшая общая мера: 2500 лет
Наибольшая общая мера: 2500 лет
 
Oceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updatedOceans 2019 tutorial-geophysical-nav_7-updated
Oceans 2019 tutorial-geophysical-nav_7-updated
 
Polar regions hindii
Polar regions hindiiPolar regions hindii
Polar regions hindii
 
Responsibility as Indian - Protection of Dharma, Samskriti and Society
Responsibility as Indian - Protection of Dharma, Samskriti and SocietyResponsibility as Indian - Protection of Dharma, Samskriti and Society
Responsibility as Indian - Protection of Dharma, Samskriti and Society
 
32.28
32.2832.28
32.28
 
تحطيم الأوهام الإدارية
تحطيم الأوهام الإداريةتحطيم الأوهام الإدارية
تحطيم الأوهام الإدارية
 
شرح أركان الإيمان لأمة الإسلام من عقيدة العوام
شرح أركان الإيمان لأمة الإسلام من عقيدة العوامشرح أركان الإيمان لأمة الإسلام من عقيدة العوام
شرح أركان الإيمان لأمة الإسلام من عقيدة العوام
 
Prelude to halide_public
Prelude to halide_publicPrelude to halide_public
Prelude to halide_public
 
ゲーム理論BASIC 第44回 -続・シャープレイ値-
ゲーム理論BASIC 第44回 -続・シャープレイ値-ゲーム理論BASIC 第44回 -続・シャープレイ値-
ゲーム理論BASIC 第44回 -続・シャープレイ値-
 
Diploma - French Diploma
Diploma - French DiplomaDiploma - French Diploma
Diploma - French Diploma
 
【ゲーム理論応用】 - 寡占市場分析4 -
【ゲーム理論応用】 - 寡占市場分析4 -【ゲーム理論応用】 - 寡占市場分析4 -
【ゲーム理論応用】 - 寡占市場分析4 -
 
Data bank KALLARA village vaikom.Kallara grama panchayath - James Joseph adh...
Data bank KALLARA  village vaikom.Kallara grama panchayath - James Joseph adh...Data bank KALLARA  village vaikom.Kallara grama panchayath - James Joseph adh...
Data bank KALLARA village vaikom.Kallara grama panchayath - James Joseph adh...
 
College raging2
College raging2College raging2
College raging2
 
Fisica matematica final
Fisica matematica finalFisica matematica final
Fisica matematica final
 
09.sdcd_lugar_geometrico_raices
09.sdcd_lugar_geometrico_raices09.sdcd_lugar_geometrico_raices
09.sdcd_lugar_geometrico_raices
 
Functional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network PerceptionFunctional Gradient Boosting based on Residual Network Perception
Functional Gradient Boosting based on Residual Network Perception
 
Kriya Sharir_Hand_Book.pdf
Kriya Sharir_Hand_Book.pdfKriya Sharir_Hand_Book.pdf
Kriya Sharir_Hand_Book.pdf
 
とちぎRuby会議01(原)
とちぎRuby会議01(原)とちぎRuby会議01(原)
とちぎRuby会議01(原)
 
Ejercicios prueba de algebra de la UTN- widmar aguilar
Ejercicios prueba de algebra de la UTN-  widmar aguilarEjercicios prueba de algebra de la UTN-  widmar aguilar
Ejercicios prueba de algebra de la UTN- widmar aguilar
 

Recently uploaded

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
ankushspencer015
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
dharasingh5698
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
dollysharma2066
 

Recently uploaded (20)

AKTU Computer Networks notes --- Unit 3.pdf
AKTU Computer Networks notes ---  Unit 3.pdfAKTU Computer Networks notes ---  Unit 3.pdf
AKTU Computer Networks notes --- Unit 3.pdf
 
UNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its PerformanceUNIT - IV - Air Compressors and its Performance
UNIT - IV - Air Compressors and its Performance
 
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...Bhosari ( Call Girls ) Pune  6297143586  Hot Model With Sexy Bhabi Ready For ...
Bhosari ( Call Girls ) Pune 6297143586 Hot Model With Sexy Bhabi Ready For ...
 
KubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghlyKubeKraft presentation @CloudNativeHooghly
KubeKraft presentation @CloudNativeHooghly
 
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
The Most Attractive Pune Call Girls Manchar 8250192130 Will You Miss This Cha...
 
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELLPVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
PVC VS. FIBERGLASS (FRP) GRAVITY SEWER - UNI BELL
 
Roadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and RoutesRoadmap to Membership of RICS - Pathways and Routes
Roadmap to Membership of RICS - Pathways and Routes
 
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
Generative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPTGenerative AI or GenAI technology based PPT
Generative AI or GenAI technology based PPT
 
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
VIP Model Call Girls Kothrud ( Pune ) Call ON 8005736733 Starting From 5K to ...
 
Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...Call for Papers - International Journal of Intelligent Systems and Applicatio...
Call for Papers - International Journal of Intelligent Systems and Applicatio...
 
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
(INDIRA) Call Girl Bhosari Call Now 8617697112 Bhosari Escorts 24x7
 
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Palanpur 7001035870 Whatsapp Number, 24/07 Booking
 
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar  ≼🔝 Delhi door step de...
Call Now ≽ 9953056974 ≼🔝 Call Girls In New Ashok Nagar ≼🔝 Delhi door step de...
 
Online banking management system project.pdf
Online banking management system project.pdfOnline banking management system project.pdf
Online banking management system project.pdf
 
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night StandCall Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Bangalore ☎ 7737669865 🥵 Book Your One night Stand
 
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 BookingVIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
VIP Call Girls Ankleshwar 7001035870 Whatsapp Number, 24/07 Booking
 
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
FULL ENJOY Call Girls In Mahipalpur Delhi Contact Us 8377877756
 
Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01Double rodded leveling 1 pdf activity 01
Double rodded leveling 1 pdf activity 01
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 

強化学習勉強会6の資料

  • 1.
  • 2.
  • 3. M = {S, A, pT, p0, g} Pr{St+1 = s′  |At = a, St = s, …} = Pr{St+1 = s′  |At = a, St = s} =: pT(s′  |s, a), Pr(S0 = s) =: p0(s) π ∈ ΠM Pr(At = a|St = s, …) = Pr(At = a|St = s) =: π(a|s) V* Vπ (s) := 𝔼 π [C0 |S0 = s], Ct := ∞ ∑ i=0 γi g(At+i, St+i), γ ∈ [0,1) f(π) f(π) := ∑ s∈S p0(s)Vπ (s) π∈ΠM f(π) M
  • 4. V* = max π∈ΠM Vπ = max a∈A (g(a, ⋅ ) + γ ∑ s′  ∈S pT(s′  | ⋅ ,a)V*(s′  )) = B*(V*) ⇒ V* π* π*d = arg max a∈A g(a, ⋅ ) + γ ∑ s′  ∈S pT(s′  | ⋅ ,a)V*(s′  ) B* ⇔ ∥B*(v) − B*(u)∥ ≤ γ∥v − u∥ vk+1 = B*(vk), v0 ∈ Rn ⇒ vk → V* k → ∞
  • 5. Bπ Vπ (s):= ∑ a∈A π(a|s)[g(a, s) + γ ∑ s′  ∈S pT(s′  |s, a)Vπ (s′  )] = 𝔼 π [g(St, At) + γVπ (St+1) St = s, ] B*V*(s):= max a∈A (g(a, s) + γ ∑ s′  ∈S pT(s′  |s, a)V*(s′  )) = max π∈ΠM 𝔼 π [g(St, At) + γV*(St+1) St = s]
  • 6.
  • 7. Ct Ct := ∞ ∑ i=0 γi g(At+i, St+i), γ ∈ [0,1) Vπ Vπ (s) := 𝔼 π [C0 |S0 = s] V* Vπ (s) := max π∈ΠM 𝔼 π [C0 |S0 = s] Qπ Qπ (s, a) := 𝔼 π [C0 |S0 = s, A0 = a] Q* Q*(s, a) := max π∈ΠM 𝔼 π [C0 |S0 = s, A0 = a] Vπ (s) = ∑ a∈A Qπ (s, a)π(a|s), V*(s) = max a∈A Q*(s, a) π*d = arg max a∈A Q*( ⋅ , a) Υπ Qπ (s, a):= 𝔼 π [g(St, At) + γQπ (St+1, At+1) St = s, At = a] = g(s, a) + γ ∑ s′  ,a′  ∈S×A pT(s′  |s, a)π(a′  |s′  )Qπ (s′  , a′  ) Υ*Q*(s, a):= 𝔼 π [g(St, At) + γ max a′  ∈A Q*(St+1, a′  ) St = s, At = a] = g(s, a) + γ max a′  ∈A ∑ s′  ∈S pT(s′  |s, a)π(a′  |s′  )Q*(s′  , a′  )
  • 8. Υπ (q) = g( ⋅ ) + γ ∑ s′  ,a′  ∈S×A pT(s′  | ⋅ )π(a′  |s′  )q(s′  , a′  ) Υ*(q) = g( ⋅ ) + γ max a′  ∈A ∑ s′  ∈S pT(s′ | ⋅ )π(a′  |s′  )q(s′  , a′  ) q, q′  : S × A → ℝ q ≤ q′  ⇔ q(s, a) ≤ q′  (s, a), ∀s, a ∈ S × A ∥q − q′  ∥ := max s,a∈S×A |q(s, a) − q′  (s, a)| q ≤ q′  ⇒ Υ(q) ≤ Υ(q′  ) Υ(q + c) = Υ(q) + γc, ∀c ∈ ℝ ⇔ ∥Υ(q) − Υ(q′  )∥ ≤ γ∥q − q′  ∥ qk+1 = Υ*(qk), q0 ∈ Rn×m ⇒ qk → Q* k → ∞ π*d = arg max a∈A Q*( ⋅ , a)
  • 9.
  • 10. Hπ t := {S0, A0, R0, …, St−1, At−1, Rt−1, At M(π)} hπ t := {s0, a0, r0, …, st−1, at−1, rt−1, st M(π)} ̂Υπ (q; hπ T)(s, a) := ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} (rt + γq(st+1, at+1) ) ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} , ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} > 0 q(s, a), ̂Υ*(q; hπ T)(s, a) := ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} (rt + γ maxa∈A q(st+1, a′  ) ) ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} , ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} > 0 q(s, a), Υπ Qπ (s, a):= 𝔼 π [g(St, At) + γQπ (St+1, At+1) St = s, At = a] Υ*Q*(s, a):= 𝔼 π [g(St, At) + γ max a′  ∈A Q*(St+1, a′  ) St = s, At = a]
  • 11. lim T→∞ 1 T T ∑ i=1 Pr(St = s, At = a|M(π)) > 0, ∀(s, a) ∈ S × A ̂Υπ ( ⋅ ; hT) → Υπ , ̂Υ*( ⋅ ; hT) → Υ* T → ∞ q ≤ q′  ⇒ ̂Υ(q) ≤ ̂Υ(q′  ) ̂Υ(q + c) = ̂Υ(q) + γc, ∀c ∈ ℝ ⇔ ∥ ̂Υ(q) − ̂Υ(q′  )∥ ≤ γ∥q − q′  ∥ qk+1 = ̂Υ*(qk), q0 ∈ Rn×m ⇒ qk → ̂Q* k → ∞ ̂π*d = arg max a∈A ̂Q*( ⋅ , a) ̂Υπ (q; hπ T)(s, a) := ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} (rt + γq(st+1, at+1) ) ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} , ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} > 0 q(s, a), ̂Υ*(q; hπ T)(s, a) := ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} (rt + γ maxa∈A q(st+1, a′ ) ) ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} , ∑ T−1 t=0 𝕀 {s=st} 𝕀 {a=at} > 0 q(s, a),
  • 12.
  • 13. qk+1 = ̂Υ*(qk : hπ ∞), q0 ∈ Rn×m ⇒ qk → Q* k → ∞ qt+1 = (1 − αt)qt + αt ̂Υ*(qt : {St, At, Rt, St+1}), 𝔼 [∥q0∥] ≤ const αt ≥ 0, ∀t ∈ ℤ≥0 ∑ t∈ℤ≥0 αt 𝕀 {s=st} 𝕀 {a=at} = ∞, ∀(s, a) ∈ S × A ∑ t∈ℤ≥0 α2 t 𝕀 {s=st} 𝕀 {a=at} < ∞, ∀(s, a) ∈ S × A lim t→∞ 𝔼 [∥qt − Q*∥2 ] = 0
  • 14. qk+1 = ̂Υ*(qk : hπ ∞), q0 ∈ Rn×m ⇒ qk → Q* k → ∞ qt+1 = (1 − αt)qt + αt ̂Υ*(qt : {St, At, Rt, St+1}), 𝔼 [∥q0∥] ≤ const at ∼ π( ⋅ |st) rt, st+1 ∼ g(st, at), pT( ⋅ : st, at) ̂qt+1(st, at) = ̂qt+1(st, at) + αt(rt + γ max a′  ∈A ̂qt(st+1, at) − ̂q(st, at)) π*d = arg max a∈A ̂q∞( ⋅ , a)
  • 15.
  • 16. vk+1 = B*(vk), v0 ∈ Rn ⇒ vk → V* k → ∞ qt+1 = (1 − αt)qt + αt ̂Υ*(qt : {St, At, Rt, St+1}) xk+1 = ft(xk) x* ft(x*) = 0 lim t→∞ ∥xt − x*∥ = 0
  • 17. vk+1 = B*(vk), v0 ∈ Rn ⇒ vk → V* k → ∞ qt+1 = (1 − αt)qt + αt ̂Υ*(qt : {St, At, Rt, St+1}) xk+1 = ft(xk, ω) x* ft(x*, ω) = 0, ∀ω ∈ Ω lim t→∞ E[∥xt − x*∥2 ] = 0
  • 18. qt+1 = (1 − αt)qt + αt ̂Υ*(qt : {St, At, Rt, St+1}) = (1 − αt)qt + αt(Υ*(qt) + Xt) Xt := ̂Υ*(qt : {St, At, Rt, St+1}) − Υ*(qt) 𝔼 [Xt] = 0, 𝔼 [∥Xt∥2 ] ≤ const