SlideShare a Scribd company logo
1 of 24
Download to read offline
DEEP LEARNING JP
[DL Papers]
http://deeplearning.jp/
The act of remembering:
A study in partially observable reinforcement learning
Mitsuhiko Nakamoto, B4, Tsuruoka Lab.
書誌情報
Title :
The act of remembering: A study in partially observable reinforcement learning
Author :
Rodrigo Toro Icarte et al. (University of Toronto)
Conference:
NeurIPS2020 RWRL Workshop (Accepted)
ICLR2021 (Rejected)
Release date :
Oct. 2020
https://openreview.net/forum?id=uFkGzn9RId8
Preliminary : Reinforcement Learning (RL)
Next State st+1
Reward rt
Action at
Policy π(a|s)
Maximize : R =
∞
∑
t=0
γt
rt
Preliminary : Markov Decision Process (MDP)
at
st
rt
at+1
st+1
rt+1
... ...
observable
unobservable
次の状態は,現在の状態と行動のみに依存
強化学習ではMDPを仮定することが多い
Preliminary : Partially Observable MDP (POMDP)
at
st
rt
at+1
st+1
rt+1
... ...
observable
unobservable
ot ot+1
現在の状態を推測するために,過去の状態遷移や観測情報が重要
なぜ POMDP に対する RL が重要か
現実世界ではほとんどの環境がpartially observable
関連研究 : Approaches to deal with POMDP
• Input last N observations (ex. DQN [2])
• Using RNN/LSTM (ex. DRQN [3])
• Using an external memory + writing action (ex. [4])
課題:
• LSTM approaches seems good, but they are computationally expensive.
• Previous works using external memory produced poor results.
DRQN [3]
提案手法
環境に記憶バッファとそれを操作できるエージェントの行動を追加
→ Memory-Augmented Environment
2つの新たな記憶バッファの形式を提案
→ Ok & OAk
Memory-Augmented Environment
w : Memoryを編集する行動
m : Memoryバッファのデータ
ā = ⟨a, w⟩
ō = ⟨o′, m′⟩ Just works like standard RL
External Memory Module
先行研究:
Kk : 直近k個の観測を保存するメモリ (ex. DQN [2])
→ kより昔の観測を記憶することが不可能
Bk : k-bitのバイナリメモリ (ex. [4])
→ 表現力が高すぎて上手く学習できない & 行動空間がkに対して指数関数的に増加
M = {0,1}k
, W = {0,1}k
M = (O ∪ {∅})k
, W = { ⊤ } ⊤ : write the memory
External Memory Module
提案手法:
Ok : k個の観測を保存 & agentのある観測をMemoryに書き込むかどうか選択させる
OAk : Okをさらにk個の観測と行動を保存できるように拡張
M = (O ∪ {∅})k
, W = { ⊤ , ⊥ }
M = ((O × A) ∪ {∅})k
, W = { ⊤ , ⊥ }
⊤ : write the memory
⊥ : do not write the memory
Key Idea : learning "when to push" is easier for the agent than learning "what to push"
Example: Gravity Domain
• 行動 : move (上下左右の4方向)
• 初期状態では,見えない重力が働きagentは上に行こうとしても上手く行けない
• 青のボタンで重力のON/OFFを切り替えることができる
• 重力は観測できない
• agentがクッキーを食べたら1の報酬が与えられ,エピソードが終了する
→ 重力のON/OFFが観測できないので,MDPでは解けない
Example: O1 Q-learning in Gravity Domain
➡ ➡
o1
o2
o3
…
➡ ➡ ➡ ➡
o1
m1
o1
m2
o1
m3
…
⊤ ⊥ ⊤ ⊤ ⊤
⊥ ⊥ ⊥
Standard tabular Q-learning O1 tabular Q-learning
Experiment : Gravity Domain
Experiment : Toward Deep RL
Results using PPO
Speedup Comparison : training time for 200,000 steps
Memory-Augmented Environments の理論的裏付け
- Memoryが最適方策(青いパス)をエンコードできる場合でもMDPとは限らない
- Memory-Augmented Environments は必ずしもMDPではない
- POMDPとしては定義できる(proved in Appendix A.1 in paper)
The recall task
- 観測は1つしない(ずっと変わらない)
- 3つの行動 a1, a2, a3
- 3つの行動を実行したらエピソード終了
- a1→a2→a3の順で実行した場合のみ報酬+1
- それ以外は報酬0
Memory-Augmented Environment における方策の評価
qπ(o, a) = 𝔼π
[
∞
∑
k=0
γk
rt+k ∣ Ot = o, At = a
]
1step TD Estimate を用いた場合…
qπ(∅,2⊤) = 0 + γqπ(2,3⊤) = γ
qπ(∅,2⊤) > qπ(∅,1⊤) = γ2
これを防ぐためには…
1. Use n-step TD estimates (or Monte-Carlo estimates)
2. Memoryバッファのサイズkを増やす
Memory-Augmented Environment における方策の改善
- MDPにおいては方策改善定理により,
現在の方策 π を greedy方策 に近づければより良い方策が得られると保証されている
- Memory-Augmented Environments は MDP ではなく POMDP
- POMDPにおいては, 方策の更新幅を小さくすれば,局所最適方策に収束することが証明されている
- ただし,大域的な最適方策に収束することは保証されていない
τ(s) = arg max
a∈A
qπ(s, a)
The best approaches for learning effective policies are…
- Monte-Carlo estimates や n-step TD methods で方策を評価
- 例えば n-step actor-critic, A3C, PPO
- 方策改善定理に従って,十分小さな更新幅で方策を更新
Discussion & Limitation
• 提案手法の Ok と OAk はほとんどの環境において Kk, Bk, LSTM より優れたパ
フォーマンスを発揮
• 計算速度がLSTMより速い
• しかし, Ok と OAk バッファサイズkにより制限されていることにより,解ける
問題が限られる
まとめ
• POMDP RL におけるライトウェイトなアプローチを提案
• 環境に「記憶バッファ」と「記憶を操作できるエージェントの行動」を追加
• 2つの新しい記憶バッファの構造「Ok」「OAk」を提案
• LSTMなど先行研究の手法よりも高い性能を発揮
References
1. The act of remembering: A study in partially observable reinforcement learning
2. Human-level control through deep reinforcement learning
3. Deep Recurrent Q-Learning for Partially Observable MDPs
4. Learning Policies with External Memory

More Related Content

More from Deep Learning JP

【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?Deep Learning JP
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究についてDeep Learning JP
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )Deep Learning JP
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...Deep Learning JP
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"Deep Learning JP
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "Deep Learning JP
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat ModelsDeep Learning JP
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"Deep Learning JP
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...Deep Learning JP
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...Deep Learning JP
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...Deep Learning JP
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...Deep Learning JP
 
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...Deep Learning JP
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...Deep Learning JP
 
【DL輪読会】マルチモーダル 基盤モデル
【DL輪読会】マルチモーダル 基盤モデル【DL輪読会】マルチモーダル 基盤モデル
【DL輪読会】マルチモーダル 基盤モデルDeep Learning JP
 
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...Deep Learning JP
 
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...Deep Learning JP
 
【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLMDeep Learning JP
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without SupervisionDeep Learning JP
 
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...Deep Learning JP
 

More from Deep Learning JP (20)

【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?【DL輪読会】Can Neural Network Memorization Be Localized?
【DL輪読会】Can Neural Network Memorization Be Localized?
 
【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について【DL輪読会】Hopfield network 関連研究について
【DL輪読会】Hopfield network 関連研究について
 
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
【DL輪読会】SimPer: Simple self-supervised learning of periodic targets( ICLR 2023 )
 
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
【DL輪読会】RLCD: Reinforcement Learning from Contrast Distillation for Language M...
 
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
【DL輪読会】"Secrets of RLHF in Large Language Models Part I: PPO"
 
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "【DL輪読会】"Language Instructed Reinforcement Learning  for Human-AI Coordination "
【DL輪読会】"Language Instructed Reinforcement Learning for Human-AI Coordination "
 
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
【DL輪読会】Llama 2: Open Foundation and Fine-Tuned Chat Models
 
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
【DL輪読会】"Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware"
 
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
【DL輪読会】Parameter is Not All You Need:Starting from Non-Parametric Networks fo...
 
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
【DL輪読会】Drag Your GAN: Interactive Point-based Manipulation on the Generative ...
 
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
【DL輪読会】Self-Supervised Learning from Images with a Joint-Embedding Predictive...
 
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
【DL輪読会】Towards Understanding Ensemble, Knowledge Distillation and Self-Distil...
 
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
【DL輪読会】VIP: Towards Universal Visual Reward and Representation via Value-Impl...
 
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
【DL輪読会】Deep Transformers without Shortcuts: Modifying Self-attention for Fait...
 
【DL輪読会】マルチモーダル 基盤モデル
【DL輪読会】マルチモーダル 基盤モデル【DL輪読会】マルチモーダル 基盤モデル
【DL輪読会】マルチモーダル 基盤モデル
 
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
【DL輪読会】TrOCR: Transformer-based Optical Character Recognition with Pre-traine...
 
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
【DL輪読会】HyperDiffusion: Generating Implicit Neural Fields withWeight-Space Dif...
 
【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM【DL輪読会】大量API・ツールの扱いに特化したLLM
【DL輪読会】大量API・ツールの扱いに特化したLLM
 
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
【DL輪読会】DINOv2: Learning Robust Visual Features without Supervision
 
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...
【DL輪読会】Poisoning Language Models During Instruction Tuning Instruction Tuning...
 

Recently uploaded

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 

Recently uploaded (20)

Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 

[DL輪読会]The act of remembering: A study in partially observable reinforcement learning

  • 1. DEEP LEARNING JP [DL Papers] http://deeplearning.jp/ The act of remembering: A study in partially observable reinforcement learning Mitsuhiko Nakamoto, B4, Tsuruoka Lab.
  • 2. 書誌情報 Title : The act of remembering: A study in partially observable reinforcement learning Author : Rodrigo Toro Icarte et al. (University of Toronto) Conference: NeurIPS2020 RWRL Workshop (Accepted) ICLR2021 (Rejected) Release date : Oct. 2020 https://openreview.net/forum?id=uFkGzn9RId8
  • 3. Preliminary : Reinforcement Learning (RL) Next State st+1 Reward rt Action at Policy π(a|s) Maximize : R = ∞ ∑ t=0 γt rt
  • 4. Preliminary : Markov Decision Process (MDP) at st rt at+1 st+1 rt+1 ... ... observable unobservable 次の状態は,現在の状態と行動のみに依存 強化学習ではMDPを仮定することが多い
  • 5. Preliminary : Partially Observable MDP (POMDP) at st rt at+1 st+1 rt+1 ... ... observable unobservable ot ot+1 現在の状態を推測するために,過去の状態遷移や観測情報が重要
  • 6. なぜ POMDP に対する RL が重要か 現実世界ではほとんどの環境がpartially observable
  • 7. 関連研究 : Approaches to deal with POMDP • Input last N observations (ex. DQN [2]) • Using RNN/LSTM (ex. DRQN [3]) • Using an external memory + writing action (ex. [4]) 課題: • LSTM approaches seems good, but they are computationally expensive. • Previous works using external memory produced poor results. DRQN [3]
  • 9. Memory-Augmented Environment w : Memoryを編集する行動 m : Memoryバッファのデータ ā = ⟨a, w⟩ ō = ⟨o′, m′⟩ Just works like standard RL
  • 10. External Memory Module 先行研究: Kk : 直近k個の観測を保存するメモリ (ex. DQN [2]) → kより昔の観測を記憶することが不可能 Bk : k-bitのバイナリメモリ (ex. [4]) → 表現力が高すぎて上手く学習できない & 行動空間がkに対して指数関数的に増加 M = {0,1}k , W = {0,1}k M = (O ∪ {∅})k , W = { ⊤ } ⊤ : write the memory
  • 11. External Memory Module 提案手法: Ok : k個の観測を保存 & agentのある観測をMemoryに書き込むかどうか選択させる OAk : Okをさらにk個の観測と行動を保存できるように拡張 M = (O ∪ {∅})k , W = { ⊤ , ⊥ } M = ((O × A) ∪ {∅})k , W = { ⊤ , ⊥ } ⊤ : write the memory ⊥ : do not write the memory Key Idea : learning "when to push" is easier for the agent than learning "what to push"
  • 12. Example: Gravity Domain • 行動 : move (上下左右の4方向) • 初期状態では,見えない重力が働きagentは上に行こうとしても上手く行けない • 青のボタンで重力のON/OFFを切り替えることができる • 重力は観測できない • agentがクッキーを食べたら1の報酬が与えられ,エピソードが終了する → 重力のON/OFFが観測できないので,MDPでは解けない
  • 13. Example: O1 Q-learning in Gravity Domain ➡ ➡ o1 o2 o3 … ➡ ➡ ➡ ➡ o1 m1 o1 m2 o1 m3 … ⊤ ⊥ ⊤ ⊤ ⊤ ⊥ ⊥ ⊥ Standard tabular Q-learning O1 tabular Q-learning
  • 17. Speedup Comparison : training time for 200,000 steps
  • 18. Memory-Augmented Environments の理論的裏付け - Memoryが最適方策(青いパス)をエンコードできる場合でもMDPとは限らない - Memory-Augmented Environments は必ずしもMDPではない - POMDPとしては定義できる(proved in Appendix A.1 in paper) The recall task - 観測は1つしない(ずっと変わらない) - 3つの行動 a1, a2, a3 - 3つの行動を実行したらエピソード終了 - a1→a2→a3の順で実行した場合のみ報酬+1 - それ以外は報酬0
  • 19. Memory-Augmented Environment における方策の評価 qπ(o, a) = 𝔼π [ ∞ ∑ k=0 γk rt+k ∣ Ot = o, At = a ] 1step TD Estimate を用いた場合… qπ(∅,2⊤) = 0 + γqπ(2,3⊤) = γ qπ(∅,2⊤) > qπ(∅,1⊤) = γ2 これを防ぐためには… 1. Use n-step TD estimates (or Monte-Carlo estimates) 2. Memoryバッファのサイズkを増やす
  • 20. Memory-Augmented Environment における方策の改善 - MDPにおいては方策改善定理により, 現在の方策 π を greedy方策 に近づければより良い方策が得られると保証されている - Memory-Augmented Environments は MDP ではなく POMDP - POMDPにおいては, 方策の更新幅を小さくすれば,局所最適方策に収束することが証明されている - ただし,大域的な最適方策に収束することは保証されていない τ(s) = arg max a∈A qπ(s, a)
  • 21. The best approaches for learning effective policies are… - Monte-Carlo estimates や n-step TD methods で方策を評価 - 例えば n-step actor-critic, A3C, PPO - 方策改善定理に従って,十分小さな更新幅で方策を更新
  • 22. Discussion & Limitation • 提案手法の Ok と OAk はほとんどの環境において Kk, Bk, LSTM より優れたパ フォーマンスを発揮 • 計算速度がLSTMより速い • しかし, Ok と OAk バッファサイズkにより制限されていることにより,解ける 問題が限られる
  • 23. まとめ • POMDP RL におけるライトウェイトなアプローチを提案 • 環境に「記憶バッファ」と「記憶を操作できるエージェントの行動」を追加 • 2つの新しい記憶バッファの構造「Ok」「OAk」を提案 • LSTMなど先行研究の手法よりも高い性能を発揮
  • 24. References 1. The act of remembering: A study in partially observable reinforcement learning 2. Human-level control through deep reinforcement learning 3. Deep Recurrent Q-Learning for Partially Observable MDPs 4. Learning Policies with External Memory