Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Evolved policy gradients
1. Evolved Policy Gradients
Rein Houthooft, Richard Y. Chen, Phillip Isola, Bradly C. Stadie,
Filip Wolski, Jonathan Ho, Pieter Abbeel
AI Lab 阿部拳之
2018/12/07
9. Algorithm
• 学習の流れ
1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
2. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数の勾配方向に学習させる
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙(𝜋 𝜃, 𝜏)]
3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
10. Algorithm
• 学習の流れ
1. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
2. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数の勾配方向に学習させる
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙(𝜋 𝜃, 𝜏)]
3. 各タスクごとに学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙∗ = arg max
𝜙
𝐸ℳ~𝑝(ℳ) 𝐸𝜏~ℳ,𝜋 𝜃∗ [𝑅 𝜏]
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
微分可能
微分できない → どうやって更新しよう...
13. Algorithm
• 学習の流れ
1. 正規分布から摂動ベクトル𝜖を𝑉個生成
2. 𝑉個の損失関数𝐿 𝜙+𝜎𝜖 𝑣
を用いて政策をそれぞれ学習
I. タスクの分布𝑝(ℳ)に従ってタスクℳを生成
II. 各タスクℳについて,政策𝜋 𝜃に従って行動して,
損失関数𝐿 𝜙+𝜎𝜖 𝑣
の勾配方向に学習
𝜃∗ = arg min
𝜃
𝐸𝜏~ℳ,𝜋 𝜃
[𝐿 𝜙+𝜎𝜖 𝑣
(𝜋 𝜃, 𝜏)]
3. 各損失関数を用いて学習した政策の𝜋 𝜃の累積報酬和を用いて,
損失関数を更新
𝜙 ← 𝜙 + 𝛿 𝑜𝑢𝑡
1
𝑉𝜎 𝑣=1
𝑉
𝐹 𝜙 + 𝜎𝜖 𝑣 𝜖 𝑣
損失関数を用いて
学習済みの
政策を生成
学習済みの
政策を用いて
損失関数を更新
Jonathan Ho Pieter Abbeel
Abstract
We propose a metalearning approach for learn-
ing gradient-based reinforcement learning (RL)
algorithms. The ideaisto evolveadifferentiable
lossfunction, such that an agent, which optimizes
its policy to minimize thisloss, will achievehigh
rewards. The loss is parametrized via temporal
convolutionsover theagent’sexperience. Because
thislossishighly flexiblein itsability to takeinto
account the agent’s history, it enables fast task
learning. Empirical results show that our evolved
policy gradient algorithm achievesfaster learning
on several randomized environments compared to
an off-the-shelf policy gradient method.
1. Introduction
When a human learns to solvea new control task, such as
playing the violin, they immediately haveafeel for what to
try. At first, they may try aquick, rough stroke, and, produc-
ing ascreech, will intuitively know thiswasthewrong thing
to do. Just by listening to thesoundsthey produce, they will
have a sense of whether or not they are making progress
toward the goal. Effectively, humans have access to very
Figure1: High-level overview of our approach. Themetho
consists of an inner and outer optimization loop. Theinn
loop (boxed) optimizes the agent’s policy against a lo
provided by the outer loop, using gradient descent. Th
outer loop optimizes the parameters of the loss functio
such that the optimized inner-loop policy achieves hig
performance on an arbitrary task, such assolving acontr
task of interest. Theevolved loss L can beunderstood as
parametrization of policy gradients’ surrogate loss, lendin
thename “evolved policy gradients".
this loss function to learn quickly on anovel task.
This approach can be seen as a form of metalearning,
which welearn alearning algorithm (12). Rather than m
ing rules that generalize across data points, asin tradition
machinelearning, metalearning concerns itself with dev
24. Performance Evolved Policy Gradients
GoalAnt
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (asmetatrained) (c) Left direction (generaFetch
25. Performance
training over 128 (policy updates) ⇥64
(update frequency) = 8196 timesteps:
PPO vsno-reward EPG
training over 256 (policy updates)
⇥128 (update frequency) = 32768
timesteps: PPO vsno-reward EPG
training over 512 (policy updates) ⇥128
(update frequency) = 65536 timesteps:
PG vsno-reward EPG.
DirectionalHopper DirectionalHalfCheetah
backwardforward
27. Performance
• 損失関数の学習時とは別の分布からタスクをサンプルしたらどうなる?
• GoalAntで実験
• 損失関数の学習時:target locationのx座標が正の範囲(右の方)
• Test time:target locationのx座標が負の範囲(左の方)
GoalAnt
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (asmetatrained) (c) Left direction (generalizat
Evolved Policy Gradients
(a) Task illustration
(b) Right direction (as metatrained) (c) Left direction (generalization)