[DL輪読会]The act of remembering: A study in partially observable reinforcement learning

DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The act of remembering:
A study in partially observable reinforcement learning
Mitsuhiko Nakamoto, B4, Tsuruoka Lab.

書誌情報
Title :
The act of remembering: A study in partially observable reinforcement learning
Author :
Rodrigo Toro Icarte et al. (University of Toronto)
Conference:
NeurIPS2020 RWRL Workshop (Accepted)
ICLR2021 (Rejected)
Release date :
Oct. 2020
https://openreview.net/forum?id=uFkGzn9RId8

Preliminary : Reinforcement Learning (RL)
Next State st+1
Reward rt
Action at
Policy π(a|s)
Maximize : R =
∞
∑
t=0
γt
rt

Preliminary : Markov Decision Process (MDP)
at
st
rt
at+1
st+1
rt+1
．．．．．．
observable
unobservable
次の状態は，現在の状態と行動のみに依存
強化学習ではMDPを仮定することが多い

Preliminary : Partially Observable MDP (POMDP)
at
st
rt
at+1
st+1
rt+1
．．．．．．
observable
unobservable
ot ot+1
現在の状態を推測するために，過去の状態遷移や観測情報が重要

なぜ POMDP に対する RL が重要か
現実世界ではほとんどの環境がpartially observable

関連研究 : Approaches to deal with POMDP
• Input last N observations (ex. DQN [2])
• Using RNN/LSTM (ex. DRQN [3])
• Using an external memory + writing action (ex. [4])
課題:
• LSTM approaches seems good, but they are computationally expensive.
• Previous works using external memory produced poor results.
DRQN [3]

提案手法
環境に記憶バッファとそれを操作できるエージェントの行動を追加
→ Memory-Augmented Environment
２つの新たな記憶バッファの形式を提案
→ Ok & OAk

Memory-Augmented Environment
w : Memoryを編集する行動
m : Memoryバッファのデータ
ā = ⟨a, w⟩
ō = ⟨o′, m′⟩ Just works like standard RL

External Memory Module
先行研究:
Kk : 直近k個の観測を保存するメモリ (ex. DQN [2])
→ kより昔の観測を記憶することが不可能
Bk : k-bitのバイナリメモリ (ex. [4])
→ 表現力が高すぎて上手く学習できない & 行動空間がkに対して指数関数的に増加
M = {0,1}k
, W = {0,1}k
M = (O ∪ {∅})k
, W = { ⊤ } ⊤ : write the memory

External Memory Module
提案手法:
Ok : k個の観測を保存 & agentのある観測をMemoryに書き込むかどうか選択させる
OAk : Okをさらにk個の観測と行動を保存できるように拡張
M = (O ∪ {∅})k
, W = { ⊤ , ⊥ }
M = ((O × A) ∪ {∅})k
, W = { ⊤ , ⊥ }
⊤ : write the memory
⊥ : do not write the memory
Key Idea : learning "when to push" is easier for the agent than learning "what to push"

Example: Gravity Domain
• 行動 : move (上下左右の4方向)
• 初期状態では，見えない重力が働きagentは上に行こうとしても上手く行けない
• 青のボタンで重力のON/OFFを切り替えることができる
• 重力は観測できない
• agentがクッキーを食べたら1の報酬が与えられ，エピソードが終了する
→ 重力のON/OFFが観測できないので，MDPでは解けない

Example: O1 Q-learning in Gravity Domain
➡ ➡
o1
o2
o3
…
➡ ➡ ➡ ➡
o1
m1
o1
m2
o1
m3
…
⊤ ⊥ ⊤ ⊤ ⊤
⊥ ⊥ ⊥
Standard tabular Q-learning O1 tabular Q-learning

Speedup Comparison : training time for 200,000 steps

Memory-Augmented Environments の理論的裏付け
- Memoryが最適方策(青いパス)をエンコードできる場合でもMDPとは限らない
- Memory-Augmented Environments は必ずしもMDPではない
- POMDPとしては定義できる(proved in Appendix A.1 in paper)
The recall task
- 観測は1つしない（ずっと変わらない）
- 3つの行動 a1, a2, a3
- 3つの行動を実行したらエピソード終了
- a1→a2→a3の順で実行した場合のみ報酬+1
- それ以外は報酬0

Memory-Augmented Environment における方策の評価
qπ(o, a) = 𝔼π
[
∞
∑
k=0
γk
rt+k ∣ Ot = o, At = a
]
1step TD Estimate を用いた場合…
qπ(∅,2⊤) = 0 + γqπ(2,3⊤) = γ
qπ(∅,2⊤) > qπ(∅,1⊤) = γ2
これを防ぐためには…
1. Use n-step TD estimates (or Monte-Carlo estimates)
2. Memoryバッファのサイズkを増やす

Memory-Augmented Environment における方策の改善
- MDPにおいては方策改善定理により，
現在の方策 π を greedy方策に近づければより良い方策が得られると保証されている
- Memory-Augmented Environments は MDP ではなく POMDP
- POMDPにおいては, 方策の更新幅を小さくすれば，局所最適方策に収束することが証明されている
- ただし，大域的な最適方策に収束することは保証されていない
τ(s) = arg max
a∈A
qπ(s, a)

The best approaches for learning eﬀective policies are…
- Monte-Carlo estimates や n-step TD methods で方策を評価
- 例えば n-step actor-critic, A3C, PPO
- 方策改善定理に従って，十分小さな更新幅で方策を更新

Discussion & Limitation
• 提案手法の Ok と OAk はほとんどの環境において Kk, Bk, LSTM より優れたパ
フォーマンスを発揮
• 計算速度がLSTMより速い
• しかし, Ok と OAk バッファサイズkにより制限されていることにより，解ける
問題が限られる

まとめ
• POMDP RL におけるライトウェイトなアプローチを提案
• 環境に「記憶バッファ」と「記憶を操作できるエージェントの行動」を追加
• 2つの新しい記憶バッファの構造「Ok」「OAk」を提案
• LSTMなど先行研究の手法よりも高い性能を発揮

References
1. The act of remembering: A study in partially observable reinforcement learning
2. Human-level control through deep reinforcement learning
3. Deep Recurrent Q-Learning for Partially Observable MDPs
4. Learning Policies with External Memory

[DL輪読会]The act of remembering: A study in partially observable reinforcement learning

Recommended

Recommended

More Related Content

More from Deep Learning JP

More from Deep Learning JP (20)

Recently uploaded

Recently uploaded (20)

[DL輪読会]The act of remembering: A study in partially observable reinforcement learning