From Event to Action: Accelerate Your Decision Making with Real-Time Automation
[DL輪読会]The act of remembering: A study in partially observable reinforcement learning
1. DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The act of remembering:
A study in partially observable reinforcement learning
Mitsuhiko Nakamoto, B4, Tsuruoka Lab.
2. 書誌情報
Title :
The act of remembering: A study in partially observable reinforcement learning
Author :
Rodrigo Toro Icarte et al. (University of Toronto)
Conference:
NeurIPS2020 RWRL Workshop (Accepted)
ICLR2021 (Rejected)
Release date :
Oct. 2020
https://openreview.net/forum?id=uFkGzn9RId8
3. Preliminary : Reinforcement Learning (RL)
Next State st+1
Reward rt
Action at
Policy π(a|s)
Maximize : R =
∞
∑
t=0
γt
rt
4. Preliminary : Markov Decision Process (MDP)
at
st
rt
at+1
st+1
rt+1
... ...
observable
unobservable
次の状態は,現在の状態と行動のみに依存
強化学習ではMDPを仮定することが多い
5. Preliminary : Partially Observable MDP (POMDP)
at
st
rt
at+1
st+1
rt+1
... ...
observable
unobservable
ot ot+1
現在の状態を推測するために,過去の状態遷移や観測情報が重要
7. 関連研究 : Approaches to deal with POMDP
• Input last N observations (ex. DQN [2])
• Using RNN/LSTM (ex. DRQN [3])
• Using an external memory + writing action (ex. [4])
課題:
• LSTM approaches seems good, but they are computationally expensive.
• Previous works using external memory produced poor results.
DRQN [3]
10. External Memory Module
先行研究:
Kk : 直近k個の観測を保存するメモリ (ex. DQN [2])
→ kより昔の観測を記憶することが不可能
Bk : k-bitのバイナリメモリ (ex. [4])
→ 表現力が高すぎて上手く学習できない & 行動空間がkに対して指数関数的に増加
M = {0,1}k
, W = {0,1}k
M = (O ∪ {∅})k
, W = { ⊤ } ⊤ : write the memory
11. External Memory Module
提案手法:
Ok : k個の観測を保存 & agentのある観測をMemoryに書き込むかどうか選択させる
OAk : Okをさらにk個の観測と行動を保存できるように拡張
M = (O ∪ {∅})k
, W = { ⊤ , ⊥ }
M = ((O × A) ∪ {∅})k
, W = { ⊤ , ⊥ }
⊤ : write the memory
⊥ : do not write the memory
Key Idea : learning "when to push" is easier for the agent than learning "what to push"
24. References
1. The act of remembering: A study in partially observable reinforcement learning
2. Human-level control through deep reinforcement learning
3. Deep Recurrent Q-Learning for Partially Observable MDPs
4. Learning Policies with External Memory