Teaching an Actor-Critic Agent Through Optimal Scripted Agent Trajectories

StarCra& II AI Workshop
[Deepmind&BlizzardStarCraftIIAI Workshop]
Teaching anActor-CriticAgentThrough
Optimal ScriptedAgentTrajectories
Chris Hoyean Song

StarCra& has 4 levels of diﬃculty.

Easy Level
Intermediate Level
Expert Level
and ..

The ulBmate level
Korean

First, let me introduce
my hoFest girlfriend.

Reinforcement Learning
Tensorflow. "Newbie" Contributor
Microsoft AI MVP
Tensorflow-KR Admin (Korean No1. ML Community)
Ex-game developer (using Unity3D)
sjhshy@gmail.com
Posting StarCraft II Reinforcement Learning Tutorials on
http://chris-chris.ai
Kakao Corp. Data Engineer
- Data Pipeline management, Real-time log processing
- Business Intelligence, Marketing Intelligence
- Develop APIs & Dev-ops

Cool Chris presents
Insert Subtitle Text Here
1. Problem DeﬁniBon
2. Lessons learned from pysc2
3. Actor-CriBc ImitaBon Learning Agent

Problem Definition1

1. Problem Definition
Level 3 :
Level 2 :
Level 4 :
Level 1 :
My topic

1
Develop
Actor-CriBc
agent
2
Get opBmal
trajectories
3
Teach agent
with opBmal
trajectories

Key idea 1 :
SAME padding + 1x1 convoluBon layer
Key idea 2 :
Base acBon masking + autoregressive sub-acBon selecBon

{"step": 1, "done": false, "obs": [[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0,
0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]], "base_acBon": 1, "x": 12, "y": 17, "reward": 0}
hFps://youtu.be/6byY0sXEk6U

Idea 1. Supervised Learning
Source : Mastering the Game of Go with Deep Neural Networks and Tree Search, Deepmind

Idea 2. Hybrid ?!
To be conBnued

Lessons learned from pysc22

Almost 100 million actions you can make at one step
Multi agents : Agents should cooperate for one common goal.
Complexity : Too large action / observation space.
Each agents should solve problems like below
- Strategy
- Economy
- Production
- Tactics
- Recon

Lesson 1
one-hot encoding
1 : marine
3 : mineral
1 : marine 1: mineral
Marine Layer Mineral Layer
3 3 3 3
1 1
1 1 1 1
player_relative

Lesson 2
Make the model simple
Agent can cover all possible
actions in StarCraft II RL
environment.
Agent can select an unit,
and handle control groups
and move them.
Agent can recall control
group, and move the unit.
14 policy network model 7 policy network model 3 policy network model (now)

Lesson 3
tf.clip_by_norm()
This operaBon prevents your model from explosion.
It is very explicit way to handle model stability. Let's tune it!

Lesson 4
32x32 map size
Default map size is 64x64
but, you don't need 4 pixel point
to represent one marine.

Reduce the map size
as much as you can.

First trial
(up, down, left, right)
=> 4 action space
OpenAI baselines – DQN
Started to learn
hFps://youtu.be/xpdQYnnxAko

Second trial
2 DQN Network
x => (32) action space
y => (32) action space
OpenAI baselines – DQN
Failed
hFps://youtu.be/RyFcWKWSGJE

I've tried to implement
the baseline agent architecture on the paper.

Some of my architecture would be incorrect.
Oriol, please give me some feedback if you ﬁnd any.

Actor-critic model architecture 1
16filters
[?, 32, 32, 16]
32filters
[?, 32, 32, 32]
observaBon
[?, 32,32,3]
5x5 3x3
1 filter
[?, 32, 32, 1]
1x1
convoluBon
1x1

3x3 1x1
16filters
[?, 32, 32, 16]
32filters
[?, 32, 32, 32]
1 filter
[?, 32, 32, 1]
1x1
convoluBon
We take 1x1 convoluBon layer for the dimension reducBon.
1x1 convoluBon layer squeeze 32 layers into 1 layer.

1x1
32ﬁlters
[?, 32, 32, 32]
1 ﬁlter
[?, 32, 32, 1]
1x1
convoluBon
conv_to_fc
[?, 1024, 1]
argmax

1x1
32ﬁlters
[?, 32, 32, 32]
fully_connected
[?, 524, 1]
argmax
conv_to_fc
[?, 32768, 1]
fully_connected
[?, 256, 1]
base acBon : 4/ SELECT_CONTROL_GROUP
masked policy
[?, 524, 1]
masking
unavailable
acBons
env
available
acBons


pi
fully_connected
[?, 524, 1]
conv_to_fc
[?, 32768, 1]
fully_connected
[?, 256, 1]
sub_acBon 4 / control_group_act
: Recall 0, Set 1, Append 2,
SetAndSteal 3, AppendAndSteal 4
sub_acBon 11 / build_queue_id
if we do not use this result,
do not back-propagate the errors on this layer
argmax
masked pi_sub4
[?, 5, 1]
masked pi_sub11
[?, 10, 1]
argmax
masked pi_sub5
[?, 10, 1]
sub_acBon 5 / control_group_id
: 0 ~ 9

1x1
32ﬁlters
[?, 32, 32, 32]
vf
[?, 1, 1]
conv_to_fc
[?, 32768, 1]
fully_connected
[?, 256, 1]

Third trial
Agent can cover all possible
actions in StarCraft II RL
environment.
14 policy network model
hFps://youtu.be/oMax7wq-KCs

Fourth Trial
Agent can select an unit, and
handle control groups and move
them.
7 policy network model
hFps://youtu.be/gNp3OnPFSRc

Fifth Trial
Agent can recall control group,
and move the unit.
3 policy network model (now)
hFps://youtu.be/V0HqK1-VfLw

Hybrid Actor-Critic Imitation Learning3

Question
I know how to develop
optimal scripted agents.
but how can I make
Actor-Critic agent
learn from it?
Question
Actor-Critic Agent
(Learning Agent)
Optimal Scripted Agent
(Optimal Agent)

Seems naive,
but intuitively make sense.
I propose new way to train Actor-Critic
Agent through optimal scripted agent
trajectories.

Actor-Critic Imitation Learning Agent
Policy
Agent 1 Agent 2 Agent 3 Agent 4
Policy
Actor-CriBc
Policy & Value Network
Policy Policy
trajectories

Actor-Critic Imitation Learning Agent
Policy
Optimal
Script
Agent 1 Agent 2 Agent 3 Agent 4
Optimal
ScriptPolicy
Actor-CriBc
Policy & Value Network
trajectories

baseline : Pure A2C Agent

1. Learn faster
2. Higher max reward
3. Robustness
Learning Rate : Random (e-5 ~ e-3)
X axis : episodes
Y axis : rewards

Detailed description of this idea will be
covered on the paper or the blog post.
The source code is on my Github
https://github.com/chris-chris/pysc2-examples
python train_mineral_shards.py
--algorithm=a2c --num_agents=2
--num_scripts=2

1. Problem DeﬁniBon
Goal : Make the RL agent learn from expert game plays.
2. Lessons learned from pysc2
Simple model / one-hot encoding / u.clip_by_norm() / Actor-CriBc Architecture
3. Hybrid Actor-CriBc ImitaBon Learning
Hybrid agent learns faster, beFer with robustness.
Wrap up

- Special Thanks to
Seungil You(Google) helped me to understand the papers and the algorithms
and supported me to find bugs and improve my tensorflow RL model.
- Deepmind and Blizzard Team
Thank you for the StarCraft II Learning Environment.
- Thanks for the precious advices
Sungjoon Choi(Disney Research), Nako Sung(Naver),
Woongwon Lee(RLCode), Doyun Lee(NC Soft).

Thank you
sjhshy@gmail.com
http://chris-chris.ai
https://www.facebook.com/ai.chris.chris
https://www.linkedin.com/in/chris-song-0bb03439/

Teaching an Actor-Critic Agent Through Optimal Scripted Agent Trajectories

Recommended

Recommended

More Related Content

Similar to Teaching an Actor-Critic Agent Through Optimal Scripted Agent Trajectories

Similar to Teaching an Actor-Critic Agent Through Optimal Scripted Agent Trajectories (20)

Recently uploaded

Recently uploaded (20)

Teaching an Actor-Critic Agent Through Optimal Scripted Agent Trajectories