The document discusses several topics related to artificial collective intelligence including:
1. Learning to compete through designing game environments, machine bidding in auctions, and creativity learning by generating texts, images, music and poetry.
2. Learning to collaborate through developing AI that can play StarCraft together as a team.
3. Using generative adversarial networks (GANs) to generate realistic data samples by having a generator and discriminator compete against each other. GANs have been applied to generate images and text.
2. Deep Reinforcement learning
• Computerised agent: Learning what to do
– How to map situations (states) to actions so as to
maximise a numerical reward signal
Sutton, Richard S., and Andrew G. Barto. Reinforcement learning: An introduction. MIT press, 1998.
5. What is next?
• All above are single AI unit
• But, true human intelligence
embraces social and collective
wisdom
– collective efforts would solve the
problem otherwise unthinkable e.g., esp
game. Crowdsourcing
• A next grand challenge of AI
– How large-scale multiple AI agents
could learn human-level collaborations
(or competitions) from their experiences?
6. What is next?
• All above are single AI unit
• But, true human intelligence
embraces social and collective
wisdom
– collective efforts would solve the
problem otherwise unthinkable e.g., esp
game. Crowdsourcing
• A next grand challenge of AI
– How large-scale multiple AI agents
could learn human-level collaborations
(or competitions) from their experiences?
Artificial Collective Intelligence
7. Artificial Collective Intelligence
• Huge applications space
– Trading robots gaming on the stock
markets,
– Ad bidding agents competing with each
other over online advertising exchanges
– E-commerce collaborative filtering
recommenders predicting user interests
through the wisdom of the crowd
– Traffic control
– Self-driving car
– Creativity learning (generative txts,
images, music, poetry)
– …
8. Summary
• Learning to compete
– Designing game environment
– Machine Bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
9. Summary
• Learning to compete
– Designing game environment
– Machine Bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
11. Controllable Environments
• We consider the environment is controllable
and strategic
• A mini-max game between the agent and the
environment
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep
Reinforcement Learning, under submission, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which finally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87
13. Design Maze: Results
Haifeng, Zhang, et al, Learning to Generate (Adversarial) Environments in Deep Reinforcement Learning, under submission, 2017
DFS
DQNOptimal
RHS
14. Summary
• Learning to compete
– Designing game environment
– Machine Bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
18. Online Advertising + Artificial Intelligence
• Design learning algorithms to make the best match
between the advertisers and Internet users with
economic constraints
•Transformed from a low-tech process to highly optimized, mathematical, computer-centric (Wall
Street-like) process
• Key directions: operations research, estimating CTR/AR; auction systems; machine learning
algorithms; behavioral targeting; fighting spam (click fraud)
21. Can we have a dynamic model?
Bidding in RTB as an RL problem
Advertiser
with ad budget
Environment
auction result,
user response
bid request
xt+1
bid request xt bid price at
• From the perspective of an advertiser with budget, sequentially bidding
in RTB is a reinforcement learning (RL) problem.
• The goal is to maximize the user responses on the displayed ads.
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
22. MDP Formulation of RTB
Environment
[s] left auction 𝑻
[s] left budget 𝑩 𝑻
2. [a]
bid 𝒂
1. [s] bid
request 𝒙 𝑻
3. [p] auction result
3. [r] user response
[s] left auction 𝑻 − 𝟏
[s] left budget 𝑩 𝑻'𝟏
[s] left auction 𝟎
[s] left budget 𝑩 𝟎
next episode
• Consider bidding in RTB as an episodic process.
[s] state [a] action [p] state transition [r] reward
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
23. Summary
• Learning to compete
– Designing game environment
– Machine bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
26. Generative Adversarial Nets (GANs)
• Minimax game between a discriminator & a generator:
– Discriminator (D) tries to correctly distinguish the true data and the
fake model-generated data
– Generator (G) tries to generate high-quality data to fool discriminator
• G & D can be implemented via neural networks
• Ideally, when D cannot distinguish the true and generated data,
G nicely fits the true underlying data distribution
[Goodfellow I, Pouget-Abadie J, Mirza M, et al. 2014. Generative adversarial nets. In NIPS 2014.]
29. GAN with Activation Maximisation
[Zhiming Zhou, Shu Rong, Han Cai, Weinan Zhang, Yong Yu, Jun Wang Generative Adversarial Nets with Labeled Data by Activation Maximization, 2017 ]
ed Data by Activation Maximization
Class 1 Class 2
Generated Sample
Final Gradient
for G
Gradient 1 Gradient 2
Figure 1. The problem of overlayed gradient of LabGAN (Sal-
imans et al., 2016) from multi-mode real data. We assume the
logit is built based on the distance between the gradient sample
and the class center.
where
↵lab
k (x) =
(Dk(x)
Dr(x)
k 2 {1, . . . , K}
1 k = K+1
. (8)
From the formulation, we see that the overall gradient w.r.t
generated example x is (1 Dr(x)). This is consistent
with the original GAN (Goodfellow et al., 2014) when no
Generative Adversarial Nets with Labeled Data by Activation Maximization
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-1.5 1.5-1.0 1.0-0.5 0.5-0.0
LabGAN Iteration:50k NLL:17.86 LabGAN Iteration:150k NLL:17.11 LabGAN Iteration:200k NLL:16.71
SAM-GAN Iteration:50k NLL:17.66 SAM-GAN Iteration:150k NLL:15.94 SAM-GAN Iteration:200k NLL:15.79
Real data p.d.f.
Gen. data
Figure 2. The generated examples along with the true density
distribution on synthetic data.
Figure 3. Training iterations on the synthetic data measured with
NNL by Oracle.
Iterations
truck
ship
hourse
frog
dog
deer
cat
bird
automobile
airplane
5,000
5.79
8.31 8.55 8.74 8.84 9.20 9.29
6.90 7.74 8.01 8.17
10,000 15,000 30,000 150,000 300,000
Inception
AM score
score 8.34
Figure 4. CIFAR-10 progress results.
Generative Adversarial Nets with Labeled Data by Activation Maximization
(a) Real Images (b) Generated Images
Figure 5. MNIST results.
30. SeqGAN – Sequence generation
• Generator is a reinforcement learning policy generating a sequence
– decide the next word to generate (action) given the previous ones as
the state
• Discriminator provides the reward (i.e. the probability of being true
data) for the whole sequence
Lantao Yu, Weinan Zhang, Jun Wang, Yong Yu. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. AAAI 2017.
33. Obama Speech Text Generation
• i stood here today i have one and
most important thing that not on
violence throughout the horizon
is OTHERS american fire and
OTHERS but we need you are a
strong source
• for this business leadership will
remember now i can’t afford to
start with just the way our
european support for the right
thing to protect those american
story from the world and
• i want to acknowledge you were
going to be an outstanding job
times for student medical
education and warm the
republicans who like my times if
he said is that brought the
• When he was told of this
extraordinary honor that he
was the most trusted man in
America
• But we also remember and
celebrate the journalism that
Walter practiced -- a standard
of honesty and integrity and
responsibility to which so many
of you have committed your
careers. It's a standard that's a
little bit harder to find today
• I am honored to be here to pay
tribute to the life and times of
the man who chronicled our
time.
Human Machine
34. Summary
• Learning to compete
– Machine Bidding in auction
– Creativity learning (generating texts, images,
music, poetry)
• Learning to collaborate
– AI plays StarCraft game
35. AI plays StarCraft
• One of the most difficult games for computers
• At least 101685 possible states (for reference, the game of Go has about
10170 states)!
• how large-scale multiple AI agents could learn human-level
collaborations, or competitions, from their experiences?
36. Bidirectional-Coordinated nets (BiCNet)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
39. “Hit and Run” tactics
combat 3 Marines (ours) vs. 1 Zealot (enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) Early stage of training (b) Early stage of training (c) Well-trained (d) Well-trained
Figure 2: Coordinated moves without collision in combat 3 Marines (ours) vs. 1 Super Zergling
(enemy). The first two (a) and (b) illustrate that the collision happens when the agents are close by
during the early stage of the training; the last two (c) and (d) illustrate coordinated moves over the
well-trained agents.
(a) time step 1: run when
attacked
(b) time step 2: fight back
when safe
(c) time step 3: run again
Attack
Move
Enemy
(d) time step 4: fight back
again
Figure 3: Hit and Run tactics in combat 3 Marines (ours) vs. 1 Zealot (enemy).
efficiently propagated through the entire networks. Yet, unlike CommNet [20], our communication is
not fully symmetric, and we maintain certain social conventions and roles by fixing the order of the
agents that join the RNN. This would help solving any possible tie between multiple optimal joint
actions [35, 36].
The structure of our bidirectionally-coordinated net (BiCNet) is illustrated in Fig. 1. It consists of
40. Coordinated moves without collision
Combat 3 Marines (ours) vs. 1 Zergling (enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
Enemy
(d) time step 4
Figure 4: Coordinated cover attack in combat 3 Marines (ours) vs. 1 Zergling (enemy).
Table 1: Winning rate against difficulty settings by hit points (HP) and damage. Training steps:
100k/200k/300k.
Difficulty
Damage=4 Damage=3
41. Focus fire
combat 15 Marines (ours) vs. 16 Marines (enemy)
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3
Attack
Move
(d) time step 4
Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
42. Coordinated heterogeneous agents
combat 2 Dropships and 2 tanks vs. 1 Ultralisk
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
(a) time step 1 (b) time step 2 (c) time step 3 (d) time step 4
Figure 5: "focus fire" in combat 15 Marines (ours) vs. 16 Marines (enemy).
(a) time step 1
Attack
Enemy
Load
Unload
(b) time step 2
igure 6: Coordinated heterogeneous agents in combat 2 Dropships and 2 tanks vs. 1 Ultralisk
ver way. Neither scattering over all enemies nor focusing on one enemy (wasting attacking fi
lso called overkill) are desired. The grouping design in the policy network serves as the
or for BiCNet to learn “focus fire without overkill”. In our experiments, we dynamically gro
agents based on agents’ geometric locations. Based on the grouping inputs, BiCNet manage