SlideShare a Scribd company logo
1 of 31
Evolved Policy Gradients
Rein Houthooft, Richard Y. Chen, Phillip Isola, Bradly C. Stadie,
Filip Wolski, Jonathan Ho, Pieter Abbeel
AI Lab 阿部拳之
2018/12/07
人間 vs RL
• 人間
• 新しいタスクについて学習する時,どうやって学習すればいいかが感覚的
にわかる
• 例:バイオリンを練習する場合,まずは早く弾いてみたり,強く弾いてみたりする
• 演奏を聞くことで,感覚的に間違った弾き方かどうかがわかる
• 他の楽器を聴いたり演奏したりした経験から得られた内部的な報酬関数
(internal reward function) を用いることができる
• RL
• 何を試したらいいかや,どのような結果が望ましいかを何も知らない
• 代わりに,外部的な報酬関数 (external reward function) によって学習する
• 過去の経験則を用いずに学習するので,人間よりも学習時間がかかる
強化学習の目的
• 報酬が増加するように,政策𝜋 𝜃のパラメータ𝜃を学習
𝜃∗ = arg max
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝑅 𝜏]
• ℳ:MDP環境
• 𝜏 = 𝑠0, 𝑎0, 𝑟0, … , 𝑠 𝐻, 𝑎 𝐻, 𝑟 𝐻 ~ℳ:
時刻t=0からt=Hまで行動したときの
経験の履歴
• 𝑅 𝜏 = 𝑡=0
𝐻
𝛾 𝑡 𝑟𝑡:
時刻t=0からt=Hまで行動したときの割引報酬和
強化学習の目的
• 報酬が増加するように,政策𝜋 𝜃のパラメータ𝜃を学習
𝜃∗ = arg max
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝑅 𝜏]
external reward functionによる目的関数
• external reward function
• 過去のタスクによる経験を活かせない
• 人間は似たようなタスクを学習する場合,過去のタスクの経験を利用できるので,
学習が早い
• RLは,タスクが新しくなる毎に0から学習しなおさないといけない
• この論文でやりたいこと
• 新しいタスクにおいて,より効率的に学習を行えるように,報酬関数
(損失関数)自体も学習させよう!→Meta learning
Relation to Existing Literature
• Meta Learning
• MAML
• タスクの分布上の累積報酬の期待値を最大化するように政策を更新
• RL2
• RNNを政策モデルとして用いることで,過去の経験も考慮する
• Learning reward function
• Inverse Reinforcement Learning
• External reward functionの勾配方向にInternal reward functionを学習
• Curiosity Learning
• 探索的な行動に対して多くの報酬を与える
• Random network distillation
Methodology
方針
• 過去のタスクの経験を用いて,新しいタスクを高速に学習するため
の学習方法を構築
• 過去の経験を明示的に用いるのではなく,損失関数に
経験を畳み込む形で非明示的に用いる
Metalearning Objective
• 政策𝜋 𝜃の目的関数
• 損失関数𝐿 𝜙を最小化するように𝜃を学習
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙(𝜋 𝜃, 𝜏)]
• 損失関数𝐿 𝜙の目的関数
• 損失関数を用いて学習した政策𝜋 𝜃∗がより高い報酬和を得るように学習
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
𝑅 𝜏から差し替え
タスクの分布に関して期待値を取っている
→様々なタスクに対して汎化させる
Algorithm
• 学習の流れ
1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
2. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数の勾配方向に学習させる
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙(𝜋 𝜃, 𝜏)]
3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
Algorithm
• 学習の流れ
1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
2. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数の勾配方向に学習させる
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙(𝜋 𝜃, 𝜏)]
3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
微分可能
微分できない → どうやって更新しよう...
Evolution Strategies
• Evolution Strategies(進化戦略)を用いて𝜙を更新
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
1. 正規分布𝑁(0, 𝐼)から摂動ベクトル𝜖を𝑉個生成
2. 目的関数𝐹(𝜙 + 𝜎𝜖)を計算
3. 𝜙を更新
𝜙 ← 𝜙 + 𝛿out
1
𝑉𝜎 𝑣=1
𝑉
𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣
𝐿 𝜙+𝜎𝜖を用いて学習した政策の累積報酬和
Evolution Strategies
1. 正規分布𝑁(0, 𝐼)から摂動ベクトル𝜖を𝑉個生成
2. 目的関数𝐹(𝜙 + 𝜎𝜖)を計算
3. 𝜙を更新
𝜙 ← 𝜙 + 𝛿out
1
𝑉𝜎 𝑣=1
𝑉
𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣
Algorithm
• 学習の流れ
1. 正規分布から摂動ベクトル𝜖を𝑉個生成
2. 𝑉個の損失関数𝐿 𝜙+𝜎𝜖 𝑣
を用いて政策をそれぞれ学習
I. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
II. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数𝐿 𝜙+𝜎𝜖 𝑣
の勾配方向に学習
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙+𝜎𝜖 𝑣
(𝜋 𝜃, 𝜏)]
3. 各損失関数を用いて学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙 ← 𝜙 + 𝛿 𝑜𝑢𝑡
1
𝑉𝜎 𝑣=1
𝑉
𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
Architecture
• 損失関数
• ある時刻𝑡における政策の経験(𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, 𝜋 𝜃 ⋅ 𝑠𝑡 )の損失を出力
• 過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁
𝑡
も踏まえて損失を計算
• 損失関数(dense layer)の入力
(𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝑓context, 𝜋 𝜃 ⋅ 𝑠𝑡 )
• 𝑓context
• 過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁
𝑡
を畳み込む形で計算
• temporal convolutional layerの出力
Architecture
ork-
ctors
tion
such,
e-th
dard
outer
MDP
rker
ence.
ment
ution
SGD
and
(6)
sthe
gates
sthe
Theagent is parametrized using an MLP policy with obser-
vation space S and action space A. The loss has amemory
unit to assist learning in theinner loop. Thismemory unit is
asingle-layer neural network to which an invariable input
temporal convolutional layer
入力:過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁
𝑡
出力:𝑓context
dense layer
入力: (𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝑓context, 𝜋 𝜃 ⋅ 𝑠𝑡 )
出力:loss
Experiments
RandomHopper
Randomized gravity, friction, body mass, link thickness
RandomWalker
RandomReacher
Randomized link lengths
DirectionalHopper
Randomized forward/backward reward function
DirectionalHalfCheetah
GoalAnt
Reward function based on the randomized target location
Evolved P
(a) Task illustration
(b) Right
Fetch
Randomized target location
Performance
• 学習済みの損失関数を用いて政策を学習させたときの累積報酬を比較
• これをtest timeと呼ぶ
• Baselines
• External reward functionを用いて学習させるPPO
• EPGは,test timeにはExternal reward functionを用いないことに注意
• MAML
• RL2
Performance
Evolved Policy Gradients
RandomHopper RandomWalker RandomReacher
Performance Evolved Policy Gradients
GoalAnt
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (asmetatrained) (c) Left direction (generaFetch
Performance
training over 128 (policy updates) ⇥64
(update frequency) = 8196 timesteps:
PPO vsno-reward EPG
training over 256 (policy updates)
⇥128 (update frequency) = 32768
timesteps: PPO vsno-reward EPG
training over 512 (policy updates) ⇥128
(update frequency) = 65536 timesteps:
PG vsno-reward EPG.
DirectionalHopper DirectionalHalfCheetah
backwardforward
Performance
• EPGによる学習済みの損失関数を用いて学習する様子
https://storage.googleapis.com/epg-blog-data/hopper_epg_2.mp4
• External reward functionを用いて学習する様子(PPO)
https://storage.googleapis.com/epg-blog-data/hopper_ppo_3.mp4
Performance
• 損失関数の学習時とは別の分布からタスクをサンプルしたらどうなる?
• GoalAntで実験
• 損失関数の学習時:target locationのx座標が正の範囲(右の方)
• Test time:target locationのx座標が負の範囲(左の方)
GoalAnt
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (asmetatrained) (c) Left direction (generalizat
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (as metatrained) (c) Left direction (generalization)
Performance
• EPGによる学習済みの損失関数を用いて学習する様子
https://storage.googleapis.com/epg-blog-data/ant_epg_3.mp4
• RL2でメタラーニングを行った後に学習する様子
https://storage.googleapis.com/epg-blog-data/ant_rl2_4.mp4
→EPGはタスクの基本的な部分(GoalAntなら歩く動作の学習)を一般
化できる?
Discussion
Discussion
• 新しいタスクに対して学習を高速にするための損失関数を学習する
方法を提案した
• 学習済みの損失関数は,従来のmeta learningの手法よりもタスクを一般
化できる特性がある
• 一般化する性能や,計算効率を上げるのがfuture works
おわり
• Open AIはEPGやRNDなど,External reward functionを用いずに学習できる方法に
注力している傾向
• 大量の並列化しないと学習終わらない...
• RNDはゲームのデバッグ自動化に応用できそうなので個人的に注目

More Related Content

Similar to Evolved policy gradients

Computational Motor Control: Reinforcement Learning (JAIST summer course)
Computational Motor Control: Reinforcement Learning (JAIST summer course) Computational Motor Control: Reinforcement Learning (JAIST summer course)
Computational Motor Control: Reinforcement Learning (JAIST summer course) hirokazutanaka
 
Batch Reinforcement Learning
Batch Reinforcement LearningBatch Reinforcement Learning
Batch Reinforcement LearningTakuma Oda
 
Long-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向についてLong-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向についてPlot Hong
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learningDeep Learning JP
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingYasunori Ozaki
 
Top-K Off-Policy Correction for a REINFORCE Recommender System
Top-K Off-Policy Correction for a REINFORCE Recommender SystemTop-K Off-Policy Correction for a REINFORCE Recommender System
Top-K Off-Policy Correction for a REINFORCE Recommender Systemharmonylab
 
20170618論文読み会 伊藤
20170618論文読み会 伊藤20170618論文読み会 伊藤
20170618論文読み会 伊藤Tomoki Itou
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?Deep Learning JP
 
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試みSusumuOTA
 
1017 論文紹介第四回
1017 論文紹介第四回1017 論文紹介第四回
1017 論文紹介第四回Kohei Wakamatsu
 
20190721 gaussian process
20190721 gaussian process20190721 gaussian process
20190721 gaussian processYoichi Tokita
 
RL_chapter1_to_chapter4
RL_chapter1_to_chapter4RL_chapter1_to_chapter4
RL_chapter1_to_chapter4hiroki yamaoka
 
北大調和系 DLゼミ A3C
北大調和系 DLゼミ A3C北大調和系 DLゼミ A3C
北大調和系 DLゼミ A3CTomoya Oda
 
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオンTech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオンTakahiro Kubo
 
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...Jun Okumura
 

Similar to Evolved policy gradients (15)

Computational Motor Control: Reinforcement Learning (JAIST summer course)
Computational Motor Control: Reinforcement Learning (JAIST summer course) Computational Motor Control: Reinforcement Learning (JAIST summer course)
Computational Motor Control: Reinforcement Learning (JAIST summer course)
 
Batch Reinforcement Learning
Batch Reinforcement LearningBatch Reinforcement Learning
Batch Reinforcement Learning
 
Long-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向についてLong-Tailed Classificationの最新動向について
Long-Tailed Classificationの最新動向について
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
Decision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence ModelingDecision Transformer: Reinforcement Learning via Sequence Modeling
Decision Transformer: Reinforcement Learning via Sequence Modeling
 
Top-K Off-Policy Correction for a REINFORCE Recommender System
Top-K Off-Policy Correction for a REINFORCE Recommender SystemTop-K Off-Policy Correction for a REINFORCE Recommender System
Top-K Off-Policy Correction for a REINFORCE Recommender System
 
20170618論文読み会 伊藤
20170618論文読み会 伊藤20170618論文読み会 伊藤
20170618論文読み会 伊藤
 
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
【DL輪読会】Is Conditional Generative Modeling All You Need For Decision-Making?
 
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み
分散型強化学習手法の最近の動向と分散計算フレームワークRayによる実装の試み
 
1017 論文紹介第四回
1017 論文紹介第四回1017 論文紹介第四回
1017 論文紹介第四回
 
20190721 gaussian process
20190721 gaussian process20190721 gaussian process
20190721 gaussian process
 
RL_chapter1_to_chapter4
RL_chapter1_to_chapter4RL_chapter1_to_chapter4
RL_chapter1_to_chapter4
 
北大調和系 DLゼミ A3C
北大調和系 DLゼミ A3C北大調和系 DLゼミ A3C
北大調和系 DLゼミ A3C
 
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオンTech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
Tech-Circle #18 Pythonではじめる強化学習 OpenAI Gym 体験ハンズオン
 
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...
ICLR2018読み会@PFN 論文紹介:Intrinsic Motivation and Automatic Curricula via Asymmet...
 

More from Kenshi Abe

二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価Kenshi Abe
 
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...Optimization Approaches for Counterfactual Risk Minimization with Continuous ...
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...Kenshi Abe
 
Deep Counterfactual Regret Minimization
Deep Counterfactual Regret MinimizationDeep Counterfactual Regret Minimization
Deep Counterfactual Regret MinimizationKenshi Abe
 
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~Kenshi Abe
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Kenshi Abe
 
Deep Q-learning from Demonstrations
Deep Q-learning from DemonstrationsDeep Q-learning from Demonstrations
Deep Q-learning from DemonstrationsKenshi Abe
 
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Multi-agent Reinforcement Learning in Sequential Social DilemmasMulti-agent Reinforcement Learning in Sequential Social Dilemmas
Multi-agent Reinforcement Learning in Sequential Social DilemmasKenshi Abe
 

More from Kenshi Abe (7)

二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価
 
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...Optimization Approaches for Counterfactual Risk Minimization with Continuous ...
Optimization Approaches for Counterfactual Risk Minimization with Continuous ...
 
Deep Counterfactual Regret Minimization
Deep Counterfactual Regret MinimizationDeep Counterfactual Regret Minimization
Deep Counterfactual Regret Minimization
 
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
多人数不完全情報ゲームにおけるAI ~ポーカーと麻雀を例として~
 
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
Competitive Multi-agent Inverse Reinforcement Learning with Sub-optimal Demon...
 
Deep Q-learning from Demonstrations
Deep Q-learning from DemonstrationsDeep Q-learning from Demonstrations
Deep Q-learning from Demonstrations
 
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Multi-agent Reinforcement Learning in Sequential Social DilemmasMulti-agent Reinforcement Learning in Sequential Social Dilemmas
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
 

Evolved policy gradients

  • 1. Evolved Policy Gradients Rein Houthooft, Richard Y. Chen, Phillip Isola, Bradly C. Stadie, Filip Wolski, Jonathan Ho, Pieter Abbeel AI Lab 阿部拳之 2018/12/07
  • 2. 人間 vs RL • 人間 • 新しいタスクについて学習する時,どうやって学習すればいいかが感覚的 にわかる • 例:バイオリンを練習する場合,まずは早く弾いてみたり,強く弾いてみたりする • 演奏を聞くことで,感覚的に間違った弾き方かどうかがわかる • 他の楽器を聴いたり演奏したりした経験から得られた内部的な報酬関数 (internal reward function) を用いることができる • RL • 何を試したらいいかや,どのような結果が望ましいかを何も知らない • 代わりに,外部的な報酬関数 (external reward function) によって学習する • 過去の経験則を用いずに学習するので,人間よりも学習時間がかかる
  • 3. 強化学習の目的 • 報酬が増加するように,政策𝜋 𝜃のパラメータ𝜃を学習 𝜃∗ = arg max 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝑅 𝜏] • ℳ:MDP環境 • 𝜏 = 𝑠0, 𝑎0, 𝑟0, … , 𝑠 𝐻, 𝑎 𝐻, 𝑟 𝐻 ~ℳ: 時刻t=0からt=Hまで行動したときの 経験の履歴 • 𝑅 𝜏 = 𝑡=0 𝐻 𝛾 𝑡 𝑟𝑡: 時刻t=0からt=Hまで行動したときの割引報酬和
  • 4. 強化学習の目的 • 報酬が増加するように,政策𝜋 𝜃のパラメータ𝜃を学習 𝜃∗ = arg max 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝑅 𝜏] external reward functionによる目的関数 • external reward function • 過去のタスクによる経験を活かせない • 人間は似たようなタスクを学習する場合,過去のタスクの経験を利用できるので, 学習が早い • RLは,タスクが新しくなる毎に0から学習しなおさないといけない • この論文でやりたいこと • 新しいタスクにおいて,より効率的に学習を行えるように,報酬関数 (損失関数)自体も学習させよう!→Meta learning
  • 5. Relation to Existing Literature • Meta Learning • MAML • タスクの分布上の累積報酬の期待値を最大化するように政策を更新 • RL2 • RNNを政策モデルとして用いることで,過去の経験も考慮する • Learning reward function • Inverse Reinforcement Learning • External reward functionの勾配方向にInternal reward functionを学習 • Curiosity Learning • 探索的な行動に対して多くの報酬を与える • Random network distillation
  • 8. Metalearning Objective • 政策𝜋 𝜃の目的関数 • 損失関数𝐿 𝜙を最小化するように𝜃を学習 𝜃∗ = arg min 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝐿 𝜙(𝜋 𝜃, 𝜏)] • 損失関数𝐿 𝜙の目的関数 • 損失関数を用いて学習した政策𝜋 𝜃∗がより高い報酬和を得るように学習 𝜙∗ = arg max 𝜙 𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏] 𝑅 𝜏から差し替え タスクの分布に関して期待値を取っている →様々なタスクに対して汎化させる
  • 9. Algorithm • 学習の流れ 1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成 2. 各タスクℳについて,政策𝜋 𝜃に従って行動して, 損失関数の勾配方向に学習させる 𝜃∗ = arg min 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝐿 𝜙(𝜋 𝜃, 𝜏)] 3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて, 損失関数を更新 𝜙∗ = arg max 𝜙 𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏] 損失関数を用いて 学習済みの 政策を生成 学習済みの 政策を用いて 損失関数を更新 Jonathan Ho Pieter Abbeel Abstract We propose a metalearning approach for learn- ing gradient-based reinforcement learning (RL) algorithms. The ideaisto evolveadifferentiable lossfunction, such that an agent, which optimizes its policy to minimize thisloss, will achievehigh rewards. The loss is parametrized via temporal convolutionsover theagent’sexperience. Because thislossishighly flexiblein itsability to takeinto account the agent’s history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm achievesfaster learning on several randomized environments compared to an off-the-shelf policy gradient method. 1. Introduction When a human learns to solvea new control task, such as playing the violin, they immediately haveafeel for what to try. At first, they may try aquick, rough stroke, and, produc- ing ascreech, will intuitively know thiswasthewrong thing to do. Just by listening to thesoundsthey produce, they will have a sense of whether or not they are making progress toward the goal. Effectively, humans have access to very Figure1: High-level overview of our approach. Themetho consists of an inner and outer optimization loop. Theinn loop (boxed) optimizes the agent’s policy against a lo provided by the outer loop, using gradient descent. Th outer loop optimizes the parameters of the loss functio such that the optimized inner-loop policy achieves hig performance on an arbitrary task, such assolving acontr task of interest. Theevolved loss L can beunderstood as parametrization of policy gradients’ surrogate loss, lendin thename “evolved policy gradients". this loss function to learn quickly on anovel task. This approach can be seen as a form of metalearning, which welearn alearning algorithm (12). Rather than m ing rules that generalize across data points, asin tradition machinelearning, metalearning concerns itself with dev
  • 10. Algorithm • 学習の流れ 1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成 2. 各タスクℳについて,政策𝜋 𝜃に従って行動して, 損失関数の勾配方向に学習させる 𝜃∗ = arg min 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝐿 𝜙(𝜋 𝜃, 𝜏)] 3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて, 損失関数を更新 𝜙∗ = arg max 𝜙 𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏] 損失関数を用いて 学習済みの 政策を生成 学習済みの 政策を用いて 損失関数を更新 Jonathan Ho Pieter Abbeel Abstract We propose a metalearning approach for learn- ing gradient-based reinforcement learning (RL) algorithms. The ideaisto evolveadifferentiable lossfunction, such that an agent, which optimizes its policy to minimize thisloss, will achievehigh rewards. The loss is parametrized via temporal convolutionsover theagent’sexperience. Because thislossishighly flexiblein itsability to takeinto account the agent’s history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm achievesfaster learning on several randomized environments compared to an off-the-shelf policy gradient method. 1. Introduction When a human learns to solvea new control task, such as playing the violin, they immediately haveafeel for what to try. At first, they may try aquick, rough stroke, and, produc- ing ascreech, will intuitively know thiswasthewrong thing to do. Just by listening to thesoundsthey produce, they will have a sense of whether or not they are making progress toward the goal. Effectively, humans have access to very Figure1: High-level overview of our approach. Themetho consists of an inner and outer optimization loop. Theinn loop (boxed) optimizes the agent’s policy against a lo provided by the outer loop, using gradient descent. Th outer loop optimizes the parameters of the loss functio such that the optimized inner-loop policy achieves hig performance on an arbitrary task, such assolving acontr task of interest. Theevolved loss L can beunderstood as parametrization of policy gradients’ surrogate loss, lendin thename “evolved policy gradients". this loss function to learn quickly on anovel task. This approach can be seen as a form of metalearning, which welearn alearning algorithm (12). Rather than m ing rules that generalize across data points, asin tradition machinelearning, metalearning concerns itself with dev 微分可能 微分できない → どうやって更新しよう...
  • 11. Evolution Strategies • Evolution Strategies(進化戦略)を用いて𝜙を更新 𝜙∗ = arg max 𝜙 𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏] 1. 正規分布𝑁(0, 𝐼)から摂動ベクトル𝜖を𝑉個生成 2. 目的関数𝐹(𝜙 + 𝜎𝜖)を計算 3. 𝜙を更新 𝜙 ← 𝜙 + 𝛿out 1 𝑉𝜎 𝑣=1 𝑉 𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣 𝐿 𝜙+𝜎𝜖を用いて学習した政策の累積報酬和
  • 12. Evolution Strategies 1. 正規分布𝑁(0, 𝐼)から摂動ベクトル𝜖を𝑉個生成 2. 目的関数𝐹(𝜙 + 𝜎𝜖)を計算 3. 𝜙を更新 𝜙 ← 𝜙 + 𝛿out 1 𝑉𝜎 𝑣=1 𝑉 𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣
  • 13. Algorithm • 学習の流れ 1. 正規分布から摂動ベクトル𝜖を𝑉個生成 2. 𝑉個の損失関数𝐿 𝜙+𝜎𝜖 𝑣 を用いて政策をそれぞれ学習 I. タスクの分布𝑝(ℳ)に従ってタスクℳを生成 II. 各タスクℳについて,政策𝜋 𝜃に従って行動して, 損失関数𝐿 𝜙+𝜎𝜖 𝑣 の勾配方向に学習 𝜃∗ = arg min 𝜃 𝐸𝜏~ℳ,𝜋 𝜃 [𝐿 𝜙+𝜎𝜖 𝑣 (𝜋 𝜃, 𝜏)] 3. 各損失関数を用いて学習した政策の𝜋 𝜃の累積報酬和を用いて, 損失関数を更新 𝜙 ← 𝜙 + 𝛿 𝑜𝑢𝑡 1 𝑉𝜎 𝑣=1 𝑉 𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣 損失関数を用いて 学習済みの 政策を生成 学習済みの 政策を用いて 損失関数を更新 Jonathan Ho Pieter Abbeel Abstract We propose a metalearning approach for learn- ing gradient-based reinforcement learning (RL) algorithms. The ideaisto evolveadifferentiable lossfunction, such that an agent, which optimizes its policy to minimize thisloss, will achievehigh rewards. The loss is parametrized via temporal convolutionsover theagent’sexperience. Because thislossishighly flexiblein itsability to takeinto account the agent’s history, it enables fast task learning. Empirical results show that our evolved policy gradient algorithm achievesfaster learning on several randomized environments compared to an off-the-shelf policy gradient method. 1. Introduction When a human learns to solvea new control task, such as playing the violin, they immediately haveafeel for what to try. At first, they may try aquick, rough stroke, and, produc- ing ascreech, will intuitively know thiswasthewrong thing to do. Just by listening to thesoundsthey produce, they will have a sense of whether or not they are making progress toward the goal. Effectively, humans have access to very Figure1: High-level overview of our approach. Themetho consists of an inner and outer optimization loop. Theinn loop (boxed) optimizes the agent’s policy against a lo provided by the outer loop, using gradient descent. Th outer loop optimizes the parameters of the loss functio such that the optimized inner-loop policy achieves hig performance on an arbitrary task, such assolving acontr task of interest. Theevolved loss L can beunderstood as parametrization of policy gradients’ surrogate loss, lendin thename “evolved policy gradients". this loss function to learn quickly on anovel task. This approach can be seen as a form of metalearning, which welearn alearning algorithm (12). Rather than m ing rules that generalize across data points, asin tradition machinelearning, metalearning concerns itself with dev
  • 14. Architecture • 損失関数 • ある時刻𝑡における政策の経験(𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, 𝜋 𝜃 ⋅ 𝑠𝑡 )の損失を出力 • 過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁 𝑡 も踏まえて損失を計算 • 損失関数(dense layer)の入力 (𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝑓context, 𝜋 𝜃 ⋅ 𝑠𝑡 ) • 𝑓context • 過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁 𝑡 を畳み込む形で計算 • temporal convolutional layerの出力
  • 15. Architecture ork- ctors tion such, e-th dard outer MDP rker ence. ment ution SGD and (6) sthe gates sthe Theagent is parametrized using an MLP policy with obser- vation space S and action space A. The loss has amemory unit to assist learning in theinner loop. Thismemory unit is asingle-layer neural network to which an invariable input temporal convolutional layer 入力:過去の経験 𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝜋 𝜃 ⋅ 𝑠𝑡 𝑖=𝑡−𝑁 𝑡 出力:𝑓context dense layer 入力: (𝑠𝑡, 𝑎 𝑡, 𝑑 𝑡, mem, 𝑓context, 𝜋 𝜃 ⋅ 𝑠𝑡 ) 出力:loss
  • 17. RandomHopper Randomized gravity, friction, body mass, link thickness RandomWalker
  • 19. DirectionalHopper Randomized forward/backward reward function DirectionalHalfCheetah
  • 20. GoalAnt Reward function based on the randomized target location Evolved P (a) Task illustration (b) Right
  • 22. Performance • 学習済みの損失関数を用いて政策を学習させたときの累積報酬を比較 • これをtest timeと呼ぶ • Baselines • External reward functionを用いて学習させるPPO • EPGは,test timeにはExternal reward functionを用いないことに注意 • MAML • RL2
  • 24. Performance Evolved Policy Gradients GoalAnt Evolved Policy Gradients (a) Task illustration (b) Right direction (asmetatrained) (c) Left direction (generaFetch
  • 25. Performance training over 128 (policy updates) ⇥64 (update frequency) = 8196 timesteps: PPO vsno-reward EPG training over 256 (policy updates) ⇥128 (update frequency) = 32768 timesteps: PPO vsno-reward EPG training over 512 (policy updates) ⇥128 (update frequency) = 65536 timesteps: PG vsno-reward EPG. DirectionalHopper DirectionalHalfCheetah backwardforward
  • 26. Performance • EPGによる学習済みの損失関数を用いて学習する様子 https://storage.googleapis.com/epg-blog-data/hopper_epg_2.mp4 • External reward functionを用いて学習する様子(PPO) https://storage.googleapis.com/epg-blog-data/hopper_ppo_3.mp4
  • 27. Performance • 損失関数の学習時とは別の分布からタスクをサンプルしたらどうなる? • GoalAntで実験 • 損失関数の学習時:target locationのx座標が正の範囲(右の方) • Test time:target locationのx座標が負の範囲(左の方) GoalAnt Evolved Policy Gradients (a) Task illustration (b) Right direction (asmetatrained) (c) Left direction (generalizat Evolved Policy Gradients (a) Task illustration (b) Right direction (as metatrained) (c) Left direction (generalization)
  • 30. Discussion • 新しいタスクに対して学習を高速にするための損失関数を学習する 方法を提案した • 学習済みの損失関数は,従来のmeta learningの手法よりもタスクを一般 化できる特性がある • 一般化する性能や,計算効率を上げるのがfuture works
  • 31. おわり • Open AIはEPGやRNDなど,External reward functionを用いずに学習できる方法に 注力している傾向 • 大量の並列化しないと学習終わらない... • RNDはゲームのデバッグ自動化に応用できそうなので個人的に注目