論文紹介：”Playing hard exploration games by watching YouTube“

論文紹介：
Playing hard exploration games by watching YouTube
Yusuf Aytar, et al. arXiv:1805.11592
（+ Demonstrationからの学習についてのまとめ）
Jun Ernesto Okumura
@pacocat

TL; DR
● YouTubeのゲームプレイ動画を使った模倣学習から、これまで学習が困難だった
Atari 2600のゲームタイトルでhuman-levelの成績を達成
○ ”Montezuma's Revenge”, “Pitfall!”, “Private Eye”の3タイトル
● ゲーム状態の表現をマルチモーダルに学習
○ Temporal Difference Classifier: ゲーム状態間の経過時間の分類
○ Cross-modal Temporal Difference Classifier: 画面と効果音との対応付の学習
○ 表現の有効性はcycle-consistencyで評価
● 事前に獲得されたゲーム表現を活用して強化学習を行い高成績を達成

“hard exploration” challenge
● 過去のDQNsの研究において、学習が困難なタスクが確認されている
○ “Montezuma's Revenge”, “Pitfall!”, “Private Eye”, …
● Problems
○ Sparse Reward: 報酬になかなかたどり着かない
■ Montezuma's Revengeの最初の報酬までは ~100 Atari Frames
■ ランダムな探索では 100^(actions) ~ 100^18 の行動系列が必要
○ Long Time Horizon: 報酬信号が反映されにくい
■ エージェントが運良く報酬にたどり着いても、 γ-factorで割り引かれている場合は
情報が希薄化されて学習が進まない
c.f. エージェントが最大化する累積期待報酬：

”Montezuma’s Revenge”
Tobias Pfaff, Learnt agent - Montezuma's Revenge, 2018. YouTube.
https://www.youtube.com/watch?v=Msy82sIfprI
(Accessed at 2018-07-09 12:00)

“Pitfall!”
Tobias Pfaff, Learnt agent - Pitfall!, 2018. YouTube.
https://www.youtube.com/watch?v=Oi1pXDaTECM
(Accessed at 2018-07-09 12:00)

“Private Eye”
Tobias Pfaff, Learnt agent - Private Eye, 2018. YouTube.
https://www.youtube.com/watch?v=I5itifmdrEo
(Accessed at 2018-07-09 12:00)

Sparse Rewards
高いところから落ちるとNG
動き周るドクロに当たるとNG
ライフはそれなりにあるので、失敗しながらやり直すことが可能
鍵がないと開かない
最初の報酬を得る
ための軌跡
Tobias Pfaff, Learnt agent - Montezuma's Revenge, 2018. YouTube.
(Accessed at 2018-07-09 12:00)

Previous Scores: DQN Family
Montezuma’s
Revenge
Pitfall! Private Eye
DQN 0.0 -286.1 146.7
DDQN 0.0 -29.9 129.7
Prior. DDQN 0.0 0.0 200.0
Duel. DDQN 0.0 0.0 103.0
Distrib. DQN 0.0 -2.1 15172.9
Noisy DQN 0.0 -18.2 3966.0
Rainbow 384.0 0.0 4234.0
Ave. Human 4743.7 6464.0 69571.0
Hessel, Matteo, et al. "Rainbow: Combining improvements in deep reinforcement learning." arXiv preprint arXiv:1710.02298(2017).
Aytar, Yusuf, et al. "Playing hard exploration games by watching YouTube." arXiv preprint arXiv:1805.11592 (2018).
● 過去にDQNsで使われていたようなε-greedyやNoisyNetのような探索では
十分なスコアを出せていなかった
○ 表はDQNsのゲームスコア（No-op starts, 200 testing episodesの平均値）
※ 各アルゴリズムの詳細については以下にまとめています：
“DQNからRainbowまで〜深層強化学習の最新動向〜“ https://www.slideshare.net/juneokumura/dqnrainbow

Previous Approaches
1. Intrinsic Motivation
○ 目新しい・情報量が大きい、といった指標を補助報酬に設定することで探索が促される
○ エージェントをモチベートすることはできるが、そもそも到達が難しい状態には行きにくい
2. Learning from demonstrations (Imitation Learning）
○ demonstratorによる軌跡を参考に学習を行う
○ Agentにはない事前知識を組み込むことができる
■ e.g. ドクロは避けたほうがいい、鍵はドアを開ける前に取得する
○ 一言にdemonstratorからの学習といってもアプローチは様々　※ Appendixで紹介

Intrinsic Motivation
● Count-based exploration
○ Model-based Interval Estimation with Exploration Bonuses (Strehl and Littman, 2008)
○ 画像状態のカウントに対応した pseudo-count function を提案
■ Pseudo-count total に対して、状態が出現する密度は
■ さらに次に状態を観測する密度を次のように表現したい。
ここから、擬似的なカウントを定義できる。
⇒
Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes.
Journal of Computer and System Sciences, 74(8), 1309-1331.
Bellman Equation Exploration Bonus
未知状態への到達に対して
報酬を与える

● Count-based exploration
○ 探索ボーナスを加えることで、 DQNでは到達できなかった部屋にも訪問できるようになった
○ 一方、DQNより成績は改善されているものの、 human-levelには程遠い
■ A3Cにcount bonusを導入したA3C+で比較（Stochastic ALEによるスコア）
Strehl, A. L., & Littman, M. L. (2008). An analysis of model-based interval estimation for Markov decision processes.
Journal of Computer and System Sciences, 74(8), 1309-1331.
Montezuma’s
Revenge
DQN 0.0 -286.1 146.7
A3C+ 142.5 -155.9 100.0
Ave. Human 4743.7 6464.0 69571.0

● Curiosity-driven exploration
○ sparseな外部報酬に加えて「想定外の状態」に報酬を感じる内部報酬を付与
○ Intrinsic Curiosity Module (ICM)：次状態の予測値と実績値の差分から内部報酬を出力
○ “VizDoom” や “Super Mario Bros” で高いパフォーマンスが報告されている
Pathak, D., Agrawal, P., Efros, A. A., & Darrell, T. (2017, May). Curiosity-driven exploration by self-supervised prediction.
In International Conference on Machine Learning (ICML) (Vol. 2017).
エージェントが予測した次状態
実際の次状態
予測と現実のギャップから感じる内部報酬（好奇心）

Imitation Learning
● Deep Q-learning from Demonstrations（DQfD）
○ エキスパートによるdemonstration dataに近づくようなlossを定義
○ 以下のlossを使い（なるべくdemo. Dataに近づくように制御しながら事前に）模倣学習
■ 各lossの重みλを変えることで模倣の度合いを調整可能
○ 事前学習で獲得した Qを使って強化学習（加えて replay bufferにdemo. dataを混ぜる）
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., ... & Osband, I. (2017). Deep Q-learning from Demonstrations.
arXiv preprint arXiv:1704.03732.
large margin classification loss ( で0、それ以外で0.8)
1-step loss n-step loss
(λ:1.0)
supervised loss
(λ:1.0)
L2 loss
(λ:1.e-5)

Imitation Learning
● Deep Q-learning from Demonstrations（DQfD）
○ 既存の提案手法よりも高いパフォーマンスが出せている
○ Replay Buffer Spiking（RBS）
■ 初期状態のreplay bufferにdemo. Dataを入れる手法
○ Human Experience Replay（HER）
■ demo. Dataを常にmini-batchに混ぜ込む手法
○ Accelerated DQN with Expert Trajectories（ADET）
■ DQfDのmargin lossをcross-entropy lossにしたもの
Hester, T., Vecerik, M., Pietquin, O., Lanctot, M., Schaul, T., Piot, B., ... & Osband, I. (2017). Deep Q-learning from Demonstrations.
arXiv preprint arXiv:1704.03732.
Montezuma’s
Revenge
DQN 0.0 -286.1 146.7
A3C+ 142.5 -155.9 100.0
DQfD 4659.0 57.3 42457.2
Ave. Human 4743.7 6464.0 69571.0

本論文のモチベーション
● 実験用にsetupされた環境のdemo. dataではなく、YouTubeなどにある様々な
プレイ動画を活用して学習をしたい
● DQfDだけでは解決できないこと
○ “domain gap”の存在を仮定していないので、解像度や色合いの違う環境には対応できない
○ demo. dataが、常に（実験者が用意するような）適切なお手本であるとは限らない

本論文の提案
● 事前学習によってゲームの表現を獲得させる（embeddingの工夫）
○ 経過時間の分類タスクによって様々な動画のシーケンスを関連付ける
○ 画像と音をクロスモーダルに関連付ける
● このようにして獲得した表現を評価するためにcycle-consistencyを導入
● 結果、過去のDQfDを大きく越えるパフォーマンスを達成した

ゲーム表現の獲得
● 2つの分類器によってゲーム表現を学習する
○ Temporal distance classification (TDC)
■ ゲーム画像間の経過時間の学習
○ Cross-modal temporal distance classification (CMC)
■ ゲーム画面と効果音の対応関係を学習

● Temporal distance classification (TDC)
○ Embedding function
○ Classifier
○ Classification loss

● Cross-modal temporal distance classification (CMC)
○ Embedding function
○ Classifier
○ Classification loss

獲得された表現の評価
● Cycle-consistency (motivated by CycleGAN)
○ 2つの入力ベクトルを考える
○ それぞれのembedding spaceにおけるユークリッド距離を導入
○ あるが以下を満たす時、 cycle-consistentであるという
○ cycle-consistentな要素の割合によって表現を評価することが可能 state space
embedding space

YouTubeからの学習
● エージェントはYouTubeのプレイ動画に近づくよう学習する
○ YouTube動画をN=16 frames毎に区切り”checkpoint”を作成
○ Checkpointに（表現空間上の内積が）近い場合は報酬を与える（本論文では）
● 利用した動画は各タイトル毎に4本
○ 3つはトレーニングに、 1つはテストに使用（それぞれ 3-10分程度の動画）
○ demo. dataの生成にAtari Learning Environmentは使っていない点が重要
● 学習はA3CとIMPALA（w/ 100 actors）をテスト

表現学習結果
● 様々な手法と比較してTDC+CMCが一番cycle-consistencyが高かった
○ : test videoとtraining videoのcycle-consistency
○ : test videoと2 training videoの 3-way cycle-consistency

Embedding Space
● TDC+CMCによって、ゲーム状態が系列情報としてembeddingできている
● domain（動画特性）によらず同じゲーム状態はembedding空間上近くなる

ゲーム文脈との対応
● 最終層を可視化すると、それぞれゲームのオブジェクトに対応付けられて
学習されていることが分かる
● CMCでは、効果音と結びつきやすい「鍵の獲得」が特徴づけられている

ゲーム成績
● 既存研究を大きく上回る結果（対象3タイトルでhuman-levelを超えている）
● 環境による報酬がなくても（≒模倣の報酬のみでも）それなりの成績

And More…
● Learning Montezuma’s Revenge from a Single Demonstration
○ OpenAIによって最近提案された Montezuma's Revengeの学習手法
■ https://blog.openai.com/learning-montezumas-revenge-from-a-single-demonstration/
○ Single demo.に対して逆順にカリキュラム学習を進めることで、 Long Time Horizonに対処
■ “Reverse Curriculum Generation for Reinforcement Learning Agents”の姉妹版？
● http://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/
○ Montezuma’s Revengeで74,500 overというSotAを達成
■ とはいえ、DeepMind（YouTube論文）の41,000との差分はダイヤモンドの獲得数が
大きいので、正直決定的な挙動の違いがあるわけではない点に注意

Curriculumの作り方
● Single demo. を逆順にタスク化し、解けるようになったら次のタスクを生成
Montezuma’s
Revenge
DQN 0.0 -286.1 146.7
DDQN 0.0 -29.9 129.7
Prior. DDQN 0.0 0.0 200.0
Duel. DDQN 0.0 0.0 103.0
Distrib. DQN 0.0 -2.1 15,172.9
Noisy DQN 0.0 -18.2 3,966.0
Rainbow 384.0 0.0 4,234.0
A3C+ 142.5 -155.9 100.0
Ave. Human 4,743.7 6,464.0 69,571.0
DQfD 4,659.0 57.3 42,457.2
YouTube 41,098.4 60,258.9 98,763.2
Single Demo. 74,500.0

関連話題
● “Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play”
○ ICLR2018読み会＠PFN
https://www.slideshare.net/juneokumura/iclr2018pfn-intrinsic-motivation-and-automatic-curricula-via-asymmetric-selfplay
○ ２つのエージェントがタスクを出し合うことで、自動的にカリキュラムを獲得する事前学習
● Single demo. 論文ではMontezumaで好成績を出せたがPitfall!では成績が悪化
○ downsizeされた画像では上手くいかないらしい＆層を深くする必要があるらしい
○ 探索や学習率のハイパーパラメータが上手く見つけられないらしい

まとめ
● YouTube動画からマルチモーダルにゲーム表現を獲得し、
探索困難なタスクを解いた論文を紹介
○ ALEを使った実験者の準備がなくても（ YouTubeというより一般的なデータソースから）、ゲーム状
態の表現が獲得できる
○ cycle-consistencyの導入によって、獲得した表現はゲームの時系列情報も埋め込まれる
○ 今回のゲームでは、効果音の情報がパフォーマンス向上に寄与していた
● （所感）DQN familyはゲームタスクで十分成果を上げてきており、次第に
同一フレームワークによるゼロベース学習以外の研究が増えている印象
○ 個別タイトルや別問題へのアプローチが広がってきている
○ ゼロベースで非効率な探索を行うよりは、 demonstrationを積極的に利用しようという
アプローチが目立ってきている（実応用上は好ましいかも）
○ マルチモーダル学習や状態行動空間・モデルの embeddingが注目を集めている印象

Appendix
Demonstrationからの学習について

Learning from Demonstrations: Overview
● Demonstrationから何を学習するかによっていくつかの手法がある
○ Behaviour Cloning
■ Expert軌跡（s, a対）を使って方策を学習（教師あり学習）
○ Direct Policy Learning via Interactive Expert
■ Expertが継続的に介入することで、より汎化した方策を学習
■ Behaviour Cloningはこれの1-step版とみることが出来る
○ Inverse Reinforcement Learning
■ Expert軌跡から方策を直接学習するのではなく、報酬関数を推定することで
そこから強化学習を行い方策獲得するアプローチ
○ Representation Learning
■ 状態や行動の表現を Expert軌跡から事前に獲得することで、強化学習を効率化
○ Demonstration Data in Reply Memory
■ Replay Bufferを使う場合にExpert軌跡を入れておくことで探索を促進するアプローチ

Behaviour Cloning
● Expertのs-a対を使って方策を学習する教師あり学習
○ Expertが訪問した状態分布を定義
○ から(s,a)対をサンプリングし、 ExpertとAgentの行動が近づくよう学習
● 課題
○ Expertが訪れていない状態にいくと（教師データがないので）学習ができなくなる
※状態sにおける、Expertの行動とAgentの行動を近づける

Direct Policy Learning via Interactive Expert
● 都度Expertが介入し続けることで、より汎化した方策を獲得する
1. Expert方策を導入
2. for m=1
a. 方策を使って軌跡を集める :
b. 軌跡から状態分布を定義 :
c. Expertのフィードバックを集める :
d. 集めた教師データを元に学習を繰り返す
● 学習の仕方には、主に2系統ある
○ Data Aggregation (e.g. DAgger)
○ Policy Aggregation (e.g. SEARN & SMILe)
Image from: Yisong Yue & Hoang M. Le, “Imitation Learning Tutorial”ICML2018, https://sites.google.com/view/icml2018-imitation-learning/

Direct Policy Learning via Interactive Expert
● Data Aggregation (e.g. DAgger)
○ サンプリング対象のデータ分布を結合することで方策を学習
● Policy Aggregation (e.g. SEARN & SMILe)
○ 直近のデータ分布から方策を学習し、過去のものと混ぜ合わせる
Ross, Stéphane, Geoffrey Gordon, and Drew Bagnell. "A reduction of imitation learning and structured prediction to no-regret online learning.",
Proceedings of the fourteenth international conference on artificial intelligence and statistics. 2011.
Daumé III, Hal. "Unsupervised search-based structured prediction." Proceedings of the 26th Annual International Conference on Machine Learning. ACM, 2009.
Ross, Stéphane, and Drew Bagnell. "Efficient reductions for imitation learning." Proceedings of the thirteenth international conference on artificial intelligence and statistics. 2010.

Inverse Reinforcement Learning
● Expertの軌跡から方策を学習するアルゴリズムでは：
○ データコストが高すぎる（対応したいすべての状態で教師データを作る必要がある）
○ Expert以上の方策は学習されない
● Inverse Reinforcement Learning
○ Expert方策を直接学習するのではなく、 Expertの報酬関数を学習
○ 学習した報酬関数をもとに、強化学習で最適方策を学習しにいく
のようなExpert軌跡が与えられた時に、
が成り立つような、を求める問題に落ちる

Inverse Reinforcement Learning
● 実際には、エキスパート軌跡を満たす報酬関数は複数考えられるため、
一定の制約条件のもとで方策分布を最適化する必要がある
● 様々な解法や関連論文があるので、詳細は例えば中田さんの以下参照
○ Maximum Entropy IRL（最大エントロピー逆強化学習）とその発展系について
■ https://www.slideshare.net/YusukeNakata1/maximum-entropy-irl
○ Generative Adversarial Imitation Learningの紹介（RLアーキテクチャ勉強会）
■ https://www.slideshare.net/YusukeNakata1/generative-adversarial-imitation-learningrl
○ 他論文多数（注釈参照）
Abbeel, Pieter, and Andrew Y. Ng. "Apprenticeship learning via inverse reinforcement learning." Proceedings of the twenty-first international conference on Machine learning. ACM, 2004.
Syed, Umar, and Robert E. Schapire. "A game-theoretic approach to apprenticeship learning." Advances in neural information processing systems. 2008.
Ziebart, Brian D., et al. "Maximum Entropy Inverse Reinforcement Learning." AAAI. Vol. 8. 2008.
Finn, Chelsea, Sergey Levine, and Pieter Abbeel. "Guided cost learning: Deep inverse optimal control via policy optimization." International Conference on Machine Learning. 2016.
Ho, Jonathan, and Stefano Ermon. "Generative adversarial imitation learning." Advances in Neural Information Processing Systems. 2016.

Representation Learning
● 方策や報酬ではなく、モデル・状態・行動といった要素の表現を
Expertデータから学習することで、探索を促進するアプローチ
○ “Playing hard exploration games by watching YouTube” (Yusuf, et al. arXiv:1805.11592)
■ 状態をYouTube動画からマルチモーダルに学習、その事前知識を強化学習に転用する
ことで、これまで学習が難しかったゲームタスクを攻略
○ “Imitation Learning with Concurrent Actions in 3D Games”
■ 多くのボタンがあるようなコントローラーを操作する際は行動数が爆発してしまうが、
Expertが使いやすいボタンの組み合わせを事前学習することで探索を促進
■ https://www.ea.com/seed/news/seed-imitation-learning-concurrent-actions

Demonstration Data in Reply Memory
● Experience Replayを使う学習において、Replay BufferにExpertの経験を混ぜ込
むことで、学習を促進させるアプローチ
○ “Replay Buffer Spiking”
■ Replay BufferをDemonstration dataで初期化することで、学習の立ち上がりを支援
■ 会話タスクで有効性を確認
○ “Human Checkpoint Replay”
■ 学習バッチに常にDemonstration dataを混ぜることでより人間の経験を強く反映
■ Atariの探索困難なタスクで有効性を確認
○ “Accelerated DQN with Expert Trajectories” / “Deep Q-learning from Demonstrations”
■ Expertによる事前学習後の強化学習時に、 Replay BufferへのExpert軌跡混入を行う
Lipton, Zachary, et al. "BBQ-Networks: Efficient Exploration in Deep Reinforcement Learning for Task-Oriented Dialogue Systems." arXiv preprint arXiv:1711.05715 (2017).
Hosu, Ionel-Alexandru, and Traian Rebedea. "Playing Atari games with deep reinforcement learning and human checkpoint replay." arXiv preprint arXiv:1607.05077 (2016).
Lakshminarayanan, Aravind S., Sherjil Ozair, and Yoshua Bengio. "Reinforcement learning with few expert demonstrations." NIPS Workshop on Deep Learning for Action and Interaction.
Vol. 2016. 2016.

論文紹介：”Playing hard exploration games by watching YouTube“

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 論文紹介：”Playing hard exploration games by watching YouTube“

Similar to 論文紹介：”Playing hard exploration games by watching YouTube“ (20)

More from Jun Okumura

More from Jun Okumura (8)

Recently uploaded

Recently uploaded (8)

論文紹介：”Playing hard exploration games by watching YouTube“