SlideShare a Scribd company logo
1 of 28
Download to read offline
A Society of AI Agents
(群体智能的社会)
Jun Wang
UCL
CCF-GAIR	2017,	Shenzhen,	China
1st
ranking UK
university
for research strength
(Research Excellence Framework, 2014)
29Nobel Prize
winners
3 Fields Medal winners
(14 non-UK)
1,107
professors
(1/3 non-UK)
12,403academic and
professional
services staff
(4,875 non-UK)
Only UK
university
awarded both
(institutional)
By	1878,	first	English	
university to	admit	female	
students	on	equal	terms
with	men
Our	founding	principles -
academic	
excellence and	research	
addressing	real-world	
problems	- continue	today
伦敦大学学院
Reinforcement	Learning (强化学习)
State St+1
An AI Agent
(智能体)
Environment
(环境)
Action at
Reward rt+1
Optimal action policy a* <--- Maximise r1 + r2 +…+ rt +…
Multi-agent Reinforcement	Learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
MARL Application: Machine Bidding in Online Advertising
Advertiser
with ad budget
Environment
auction	result,
user	response
bid	request	
xt+1
bid	request	xt bid	price	at
The	goal	is	to	maximise the	user	responses	on	displayed	ads
Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising."
In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017.
e 9: Overall performance on iPinYou under
05
and di↵erent budget conditions.
he generalization ability of the neural network is sat-
ry only in small scales. For relatively large scales,
neralization of RLB-NN is not reliable. (iii) Com-
to RLB-NN, the 3 sophisticated algorithms RLB-
eg, RLB-NN-MapD and RLB-NN-MapA are more
and outperform Lin under every budget condition.
do not rely on the generalization ability of the approx-
n model, therefore their performance is more stable.
esults clearly demonstrate that they are e↵ective so-
s for the large-scale problem. (iv) As for eCPC, all
s except from Mcpc are very close, thus making the
sed RLB algorithms practically e↵ective.
ONLINE DEPLOYMENT AND A/B TEST
proposed RLB model is deployed and tested in a live
nment provided by Vlion DSP. The deployment envi-
nt is based on HP ProLiant DL360p Gen8 servers. A
e cluster is utilized for the bidding agent, where each
Figure 10: Online A/
0 50 100 150 200 250
Episodes
0
50
100
150
200
250
300
350
400
TotalClicks
Lin
RLB
Figure 11: Total clicks a
episodes.
10, while the click and cost pe
shown in Figure 11.
From the comparison, we obs
the same cost, RLB achieves l
thus more total clicks, which s
of RLB. (ii) RLB provides bet
imation performance of the neural
iPinYou YOYI
(⇥10 6) 0.998 1.263
✓avg (⇥10 4) 9.404 11.954
all performance on iPinYou under
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
Bids (⇥106
)
Lin RLB
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
Impressions (⇥105
)
Lin RLB
0
50
100
150
200
250
300
350
400
Total Clicks
Lin RLB
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
CTR (%)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
CPM (CNY)
Lin RLB
0.0
0.2
0.4
0.6
0.8
1.0
1.2
eCPC (CNY)
Figure 10: Online A/B testing results.
MARL Application: AI plays StarCraft
Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally-
Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
Multi-agent Reinforcement	Learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
Multi-agent Reinforcement	learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
Problem 1: current	research	is	limited only	
to less than 20 agents
Million-agent application: the	Uber/DD case
Million-agent application: 共享单车
case
Million-agent application: 共享单车
case
Ecology:	self-organisation
• In	many	real-world	populations,	certain	regularity	and	
order	from	collective	behaviors	may	exist	everywhere
• The	theory	of	self-organisation:	driven	by	repeated	
interactions	and	rules	between	local	individuals	that	
are	initially	disordered
• Limitation:	without	invoking	individual	complexity.
David JT Sumpter. The principles of collective animal behaviour. Philosophical Transactions of the
Royal Society of London B: Biological Sciences , 361(1465):5–22, 2006.
Ecology:	the	Lotka-Volterra (LV)	model
• A	major	topic	of	population	dynamics	is	the	
cycling	of	predator	and	prey	populations
• The	Lotka-Volterra model	is	used	to	model	this	
• 猞猁 (lynx)和兔子(hare)
Lotka, A. J. (1910). "Contribution to the Theory of Periodic Reaction". J. Phys.
Chem. 14 (3): 271–274.
Predator Prey Obstacle Health ID Group1 Group2
2
1
3 4
6
1
2
3 4
6
3
Timestep t Timestep t+1
5
5
Artificial Population: Large-scale predator-prey	world
The	setting:
• Predators hunt	the	
prey so	as	to	survive	
from	starvation.	
• Each	predator	has its	
own	health	bar	and	
eyesight	view.
• Predators	can	form	a	
group to	hunt	the	prey
• Predators are scaled
up to 1 million
Yaodong Yang	,	Lantao Yu	,	Yiwei Bai	,	Jun	Wang	,	Weinan Zhang	,	Ying	Wen	,	Yong	Yu,	, Dynamics	of	Artificial	Populations	by	Million-agent	
Reinforcement	Learning,	2017
Reinforcement	Learning	with	1	millions	agents
Q-network
Experience
Buffer
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(Obs, ID)
Q-value
(st, at, rt, st+1)
updates action
ID embedding
action
reward
action
reward
reward
...
...
(st, at, rt, st+1)
(st, at, rt, st+1)
1
2
3 4
6 5
Figure 2: Million-agent Q-learning in Predator-prey World.
borating with others. We keep alternating the environments by feeding these two
after another (axiom 6) and examine the dynamics of grouping behaviors. To emYaodong Yang	,	Lantao Yu	,	Yiwei Bai	,	Jun	Wang	,	Weinan Zhang	,	Ying	Wen	,	Yong	Yu,	, Dynamics	of	Artificial	Populations	by	Million-agent	
Reinforcement	Learning,	2017
⌘(⇡) = E⌧ [ t=0
t
R(st, at)], where ⌧ = (s0, a0, r0, s1, ...) denotes the trajectory s
nt’s policy at ⇠ ⇡✓(at|st), initial state distribution s0 ⇠ ⇢0(s0), and state transition f
T (st+1|st, at). For each agent, its goal is to learn ✓⇤
:= argmax✓⌘(⇡).
e Setting of Many-agent Q-Learning
ulti-agent setting, agents share a common Q-value function approximated by a deep
Qt(si
t, ai
t) = Qt((Oi
t, vi
), ai
t). We refer more details of DQN to [21]. In this gam
le to let the whole population share the same network, as biologically speaking, ind
es tend to inherit the same characteristics from their ancestors [32, 33, 34], therefo
ssume that the intelligence level of each predator is the same. However, note that apa
rent observations for each agent, the Q-network also takes a real-valued identity v
hich incorporates the individual uniqueness into decision making process. Conside
on in the action space, we apply ✏-greedy methods on selecting the action, ⇡✓(at|
Qi
t(si
t, ai
t)). At each timestep, all agents contribute its experienced transitions (si
t, ai
t, r
ffer, as shown in Fig. 2. Based on experience sampled from the buffer, the shared Q-n
d as:
Q(si
t, ai
t) Q(si
t, ai
t) + ↵[ri
t + max
a02A
Q(si
t+1, a0
) Q(si
t, ai
t)].
periments and Findings
The action space A: {move forward, backward, left, right, rotate left, rotate right, stand still,
join a group, and leave a group}.
What happens if disable	the	predators’	
learning	ability
• Predators	fail	to	adapt	to	the	new	environment
• The	artificial	ecosystem	collapses	quickly
Yaodong Yang	,	Lantao Yu	,	Yiwei Bai	,	Jun	Wang	,	Weinan Zhang	,	Ying	Wen	,	Yong	Yu,	, Dynamics	of	Artificial	Populations	by	Million-agent	
Reinforcement	Learning,	2017
The	Dynamics	of	the	Artificial	Population
Yaodong Yang	,	Lantao Yu	,	Yiwei Bai	,	Jun	Wang	,	Weinan Zhang	,	Ying	Wen	,	Yong	Yu,	, Dynamics	of	Artificial	Populations	by	Million-agent	
Reinforcement	Learning,	2017
Tiger-sheep-rabbit:	Grouping
Yaodong Yang	,	Lantao Yu	,	Yiwei Bai	,	Jun	Wang	,	Weinan Zhang	,	Ying	Wen	,	Yong	Yu,	, Dynamics	of	Artificial	Populations	by	Million-agent	
Reinforcement	Learning,	2017
Multi-agent Reinforcement	learning
(多智能体的强化学习)
Environment
(环境)
An AI Agent
(智能体)
State St+1
An AI Agent
(智能体)
Action at
Reward rt+1
StateSt+1
Actionat
Rewardrt+1
An AI Agent
(智能体)
State St+1
Action at
Reward rt+1
Problem 1: current	research	is	limited only	
to less than 20 agents
Problem 2: the	environment	is	assumed	to	
be	given	and	not	designable
Learning	to	design	shopping	space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4
IKEA:	designing	
shopping	space	
to	impulse	
customers	
purchase	and	
long	stay
Learning	to	design	shopping	space
A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the
5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005.
https://www.youtube.com/watch?v=NkePRXxH9D4
Learning	to	Design Map	for	Parcel-sorting	Bots
https://www.youtube.com/watch?v=_QndP_PCRSw
Learning	to	Design	Environments	(Games)
• We	consider	the	environment	is	controllable	
and	strategic
• A	mini-max game	between	the	agent	and	the	
environment			
Haifeng Zhang,	Jun	Wang	,	Zhiming Zhou	,	Weinan Zhang	,	Ying	Wen	,	Wenxin Li, Learning
to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
1. Generate
Environments
2. Each environment
trains an agent
3. Operate in the
environments with
4. Agent return
G ...G1 6
Agent
πµ
θ
θ
A
Environment
Generator
M
ϕw
θ1
A
M θ2
A
M θ3
A
M
θ4
A
M θ5
A
M θ6
A
M
respectively...πϕ1
πϕ6
generator update
guide the
1: Framework dealing with non-differentiable transitions. Generator generates environmen
ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste
esponding environments and returns are observed, which finally guide the generator to updat
olution for Undifferentiable Transition
gh we have proved the equivalence between the transition optimization and the policy o
In this paper, we consider a particular objective of MDP that the MDP acts as an83
environment minimizing the expected return of the agent, i.e. O(H) =
P1
t=1
t
84
Thus, the objective function is formulated as:85
✓⇤
= arg min
✓
max E[G|⇡ ; M✓ = hS, A, P✓, R, i].
This adversarial objective can be applied to design environments to analyse the weakness86
and its policy learning algorithms.87
Controllable	Environments:	An	example
• Maze:	
• The	AI	agent:	try	to	find	an	optimal	strategy	to	find	
the	way	out.	
• The	environment:		generate	a	Maze	to	make	it	
difficult	to	find	a	way
Haifeng Zhang,	Jun	Wang	,	Zhiming Zhou	,	Weinan Zhang	,	Ying	Wen	,	Wenxin Li, Learning
to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
Learning to design Environments	(Games)
Learning	to	Design	Maze:	Results
DFS
DQNOptimal
RHS
Haifeng Zhang,	Jun	Wang	,	Zhiming Zhou	,	Weinan Zhang	,	Ying	Wen	,	Wenxin Li, Learning
to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
Machine	vs	Human	
Intelligence
“...even	if	these	artifacts	(machines)	
perform	certain	acts	better	than	us,	they	
would	do	it	without	the	conscience of	
them...	...it	is	morally	impossible	that	a	
machine	will	work	in	all	the	
circumstances	of	life	in	the	same	way	as	
our	reason	makes	us	work”.
—Descartes,	René	(1596–1650),	French	
philosopher,	mathematician, and man	of	
science.
Thanks for your attention

More Related Content

Similar to A Society of AI Agents

Similar to A Society of AI Agents (20)

Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAIDeep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
Deep Reinforcement Learning Through Policy Optimization, John Schulman, OpenAI
 
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
PMED Undergraduate Workshop - Introduction to Reinforcement Learning - Lili W...
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
 
Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms Nature-Inspired Optimization Algorithms
Nature-Inspired Optimization Algorithms
 
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
Higher Order Fused Regularization for Supervised Learning with Grouped Parame...
 
Slides banditrmhc
Slides banditrmhcSlides banditrmhc
Slides banditrmhc
 
Counterfactual Learning for Recommendation
Counterfactual Learning for RecommendationCounterfactual Learning for Recommendation
Counterfactual Learning for Recommendation
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Phd Proposal
Phd ProposalPhd Proposal
Phd Proposal
 
Data-Driven Recommender Systems
Data-Driven Recommender SystemsData-Driven Recommender Systems
Data-Driven Recommender Systems
 
Weitao Duan - Creating economic opportunity for every linkedin member amid ne...
Weitao Duan - Creating economic opportunity for every linkedin member amid ne...Weitao Duan - Creating economic opportunity for every linkedin member amid ne...
Weitao Duan - Creating economic opportunity for every linkedin member amid ne...
 
ddpg seminar
ddpg seminarddpg seminar
ddpg seminar
 
Seminar talk, 2008
Seminar talk, 2008Seminar talk, 2008
Seminar talk, 2008
 
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
2018 Modern Math Workshop - Foundations of Statistical Learning Theory: Quint...
 
Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4Random Matrix Theory and Machine Learning - Part 4
Random Matrix Theory and Machine Learning - Part 4
 
MPCR_R_O_V_E_R_Final
MPCR_R_O_V_E_R_FinalMPCR_R_O_V_E_R_Final
MPCR_R_O_V_E_R_Final
 
Classification
ClassificationClassification
Classification
 
Multimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QAMultimodal Residual Networks for Visual QA
Multimodal Residual Networks for Visual QA
 
Deep learning study 2
Deep learning study 2Deep learning study 2
Deep learning study 2
 
Phd Seminar talk
Phd Seminar talkPhd Seminar talk
Phd Seminar talk
 

More from Jun Wang

Wsdm2015
Wsdm2015Wsdm2015
Wsdm2015
Jun Wang
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Jun Wang
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
Jun Wang
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
Jun Wang
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
Jun Wang
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Jun Wang
 
Financial methods in online advertising
 Financial methods in online advertising Financial methods in online advertising
Financial methods in online advertising
Jun Wang
 

More from Jun Wang (11)

Artificial Collective Intelligence
Artificial Collective IntelligenceArtificial Collective Intelligence
Artificial Collective Intelligence
 
Wsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-biddingWsdm17 value-at-risk-bidding
Wsdm17 value-at-risk-bidding
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Wsdm2015
Wsdm2015Wsdm2015
Wsdm2015
 
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display AdvertisingWeinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
Weinan Zhang's KDD15 Talk: Statistical Arbitrage Mining for Display Advertising
 
On Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time AdvertisingOn Search, Personalisation and Real-time Advertising
On Search, Personalisation and Real-time Advertising
 
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
A Brief Introduction of Real-time Bidding Display Advertising and Evaluation ...
 
Portfolio Theory of Information Retrieval
Portfolio Theory of Information RetrievalPortfolio Theory of Information Retrieval
Portfolio Theory of Information Retrieval
 
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...On Statistical Analysis and Optimization of Information Retrieval Effectivene...
On Statistical Analysis and Optimization of Information Retrieval Effectivene...
 
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...Statistical Information Retrieval Modelling:  from the Probability Ranking Pr...
Statistical Information Retrieval Modelling: from the Probability Ranking Pr...
 
Financial methods in online advertising
 Financial methods in online advertising Financial methods in online advertising
Financial methods in online advertising
 

Recently uploaded

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 

A Society of AI Agents

  • 1. A Society of AI Agents (群体智能的社会) Jun Wang UCL CCF-GAIR 2017, Shenzhen, China
  • 2. 1st ranking UK university for research strength (Research Excellence Framework, 2014) 29Nobel Prize winners 3 Fields Medal winners (14 non-UK) 1,107 professors (1/3 non-UK) 12,403academic and professional services staff (4,875 non-UK) Only UK university awarded both (institutional) By 1878, first English university to admit female students on equal terms with men Our founding principles - academic excellence and research addressing real-world problems - continue today 伦敦大学学院
  • 3. Reinforcement Learning (强化学习) State St+1 An AI Agent (智能体) Environment (环境) Action at Reward rt+1 Optimal action policy a* <--- Maximise r1 + r2 +…+ rt +…
  • 4. Multi-agent Reinforcement Learning (多智能体的强化学习) Environment (环境) An AI Agent (智能体) State St+1 An AI Agent (智能体) Action at Reward rt+1 StateSt+1 Actionat Rewardrt+1 An AI Agent (智能体) State St+1 Action at Reward rt+1
  • 5. MARL Application: Machine Bidding in Online Advertising Advertiser with ad budget Environment auction result, user response bid request xt+1 bid request xt bid price at The goal is to maximise the user responses on displayed ads Cai, H., K. Ren, W. Zhag, K. Malialis, and J. Wang. "Real-Time Bidding by Reinforcement Learning in Display Advertising." In The Tenth ACM International Conference on Web Search and Data Mining (WSDM). ACM, 2017. e 9: Overall performance on iPinYou under 05 and di↵erent budget conditions. he generalization ability of the neural network is sat- ry only in small scales. For relatively large scales, neralization of RLB-NN is not reliable. (iii) Com- to RLB-NN, the 3 sophisticated algorithms RLB- eg, RLB-NN-MapD and RLB-NN-MapA are more and outperform Lin under every budget condition. do not rely on the generalization ability of the approx- n model, therefore their performance is more stable. esults clearly demonstrate that they are e↵ective so- s for the large-scale problem. (iv) As for eCPC, all s except from Mcpc are very close, thus making the sed RLB algorithms practically e↵ective. ONLINE DEPLOYMENT AND A/B TEST proposed RLB model is deployed and tested in a live nment provided by Vlion DSP. The deployment envi- nt is based on HP ProLiant DL360p Gen8 servers. A e cluster is utilized for the bidding agent, where each Figure 10: Online A/ 0 50 100 150 200 250 Episodes 0 50 100 150 200 250 300 350 400 TotalClicks Lin RLB Figure 11: Total clicks a episodes. 10, while the click and cost pe shown in Figure 11. From the comparison, we obs the same cost, RLB achieves l thus more total clicks, which s of RLB. (ii) RLB provides bet imation performance of the neural iPinYou YOYI (⇥10 6) 0.998 1.263 ✓avg (⇥10 4) 9.404 11.954 all performance on iPinYou under Lin RLB 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Bids (⇥106 ) Lin RLB 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 Impressions (⇥105 ) Lin RLB 0 50 100 150 200 250 300 350 400 Total Clicks Lin RLB 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 CTR (%) Lin RLB 0.0 0.2 0.4 0.6 0.8 1.0 1.2 CPM (CNY) Lin RLB 0.0 0.2 0.4 0.6 0.8 1.0 1.2 eCPC (CNY) Figure 10: Online A/B testing results.
  • 6. MARL Application: AI plays StarCraft Peng Peng, Quan Yuan, Ying Wen, Yaodong Yang, Zhenkun Tang, Haitao Long, Jun Wang, Multiagent Bidirectionally- Coordinated Nets for Learning to Play StarCraft Combat Games, 2017
  • 7. Multi-agent Reinforcement Learning (多智能体的强化学习) Environment (环境) An AI Agent (智能体) State St+1 An AI Agent (智能体) Action at Reward rt+1 StateSt+1 Actionat Rewardrt+1 An AI Agent (智能体) State St+1 Action at Reward rt+1
  • 8. Multi-agent Reinforcement learning (多智能体的强化学习) Environment (环境) An AI Agent (智能体) State St+1 An AI Agent (智能体) Action at Reward rt+1 StateSt+1 Actionat Rewardrt+1 An AI Agent (智能体) State St+1 Action at Reward rt+1 Problem 1: current research is limited only to less than 20 agents
  • 12. Ecology: self-organisation • In many real-world populations, certain regularity and order from collective behaviors may exist everywhere • The theory of self-organisation: driven by repeated interactions and rules between local individuals that are initially disordered • Limitation: without invoking individual complexity. David JT Sumpter. The principles of collective animal behaviour. Philosophical Transactions of the Royal Society of London B: Biological Sciences , 361(1465):5–22, 2006.
  • 13. Ecology: the Lotka-Volterra (LV) model • A major topic of population dynamics is the cycling of predator and prey populations • The Lotka-Volterra model is used to model this • 猞猁 (lynx)和兔子(hare) Lotka, A. J. (1910). "Contribution to the Theory of Periodic Reaction". J. Phys. Chem. 14 (3): 271–274.
  • 14. Predator Prey Obstacle Health ID Group1 Group2 2 1 3 4 6 1 2 3 4 6 3 Timestep t Timestep t+1 5 5 Artificial Population: Large-scale predator-prey world The setting: • Predators hunt the prey so as to survive from starvation. • Each predator has its own health bar and eyesight view. • Predators can form a group to hunt the prey • Predators are scaled up to 1 million Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent Reinforcement Learning, 2017
  • 15. Reinforcement Learning with 1 millions agents Q-network Experience Buffer (Obs, ID) Q-value (Obs, ID) Q-value (Obs, ID) Q-value (Obs, ID) Q-value (st, at, rt, st+1) updates action ID embedding action reward action reward reward ... ... (st, at, rt, st+1) (st, at, rt, st+1) 1 2 3 4 6 5 Figure 2: Million-agent Q-learning in Predator-prey World. borating with others. We keep alternating the environments by feeding these two after another (axiom 6) and examine the dynamics of grouping behaviors. To emYaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent Reinforcement Learning, 2017 ⌘(⇡) = E⌧ [ t=0 t R(st, at)], where ⌧ = (s0, a0, r0, s1, ...) denotes the trajectory s nt’s policy at ⇠ ⇡✓(at|st), initial state distribution s0 ⇠ ⇢0(s0), and state transition f T (st+1|st, at). For each agent, its goal is to learn ✓⇤ := argmax✓⌘(⇡). e Setting of Many-agent Q-Learning ulti-agent setting, agents share a common Q-value function approximated by a deep Qt(si t, ai t) = Qt((Oi t, vi ), ai t). We refer more details of DQN to [21]. In this gam le to let the whole population share the same network, as biologically speaking, ind es tend to inherit the same characteristics from their ancestors [32, 33, 34], therefo ssume that the intelligence level of each predator is the same. However, note that apa rent observations for each agent, the Q-network also takes a real-valued identity v hich incorporates the individual uniqueness into decision making process. Conside on in the action space, we apply ✏-greedy methods on selecting the action, ⇡✓(at| Qi t(si t, ai t)). At each timestep, all agents contribute its experienced transitions (si t, ai t, r ffer, as shown in Fig. 2. Based on experience sampled from the buffer, the shared Q-n d as: Q(si t, ai t) Q(si t, ai t) + ↵[ri t + max a02A Q(si t+1, a0 ) Q(si t, ai t)]. periments and Findings The action space A: {move forward, backward, left, right, rotate left, rotate right, stand still, join a group, and leave a group}.
  • 16. What happens if disable the predators’ learning ability • Predators fail to adapt to the new environment • The artificial ecosystem collapses quickly Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent Reinforcement Learning, 2017
  • 17. The Dynamics of the Artificial Population Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent Reinforcement Learning, 2017
  • 18. Tiger-sheep-rabbit: Grouping Yaodong Yang , Lantao Yu , Yiwei Bai , Jun Wang , Weinan Zhang , Ying Wen , Yong Yu, , Dynamics of Artificial Populations by Million-agent Reinforcement Learning, 2017
  • 19. Multi-agent Reinforcement learning (多智能体的强化学习) Environment (环境) An AI Agent (智能体) State St+1 An AI Agent (智能体) Action at Reward rt+1 StateSt+1 Actionat Rewardrt+1 An AI Agent (智能体) State St+1 Action at Reward rt+1 Problem 1: current research is limited only to less than 20 agents Problem 2: the environment is assumed to be given and not designable
  • 20. Learning to design shopping space A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the 5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005. https://www.youtube.com/watch?v=NkePRXxH9D4 IKEA: designing shopping space to impulse customers purchase and long stay
  • 21. Learning to design shopping space A. Penn. The complexity of the elementary interface: shopping space. In Proceedings to the 5th International Space Syntax Symposium , volume 1, pages 25–42. Akkelies van Nes, 2005. https://www.youtube.com/watch?v=NkePRXxH9D4
  • 23. Learning to Design Environments (Games) • We consider the environment is controllable and strategic • A mini-max game between the agent and the environment Haifeng Zhang, Jun Wang , Zhiming Zhou , Weinan Zhang , Ying Wen , Wenxin Li, Learning to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017 1. Generate Environments 2. Each environment trains an agent 3. Operate in the environments with 4. Agent return G ...G1 6 Agent πµ θ θ A Environment Generator M ϕw θ1 A M θ2 A M θ3 A M θ4 A M θ5 A M θ6 A M respectively...πϕ1 πϕ6 generator update guide the 1: Framework dealing with non-differentiable transitions. Generator generates environmen ter ✓. For each ✓, agents are trained until optimal policies are obtained. Then agents are teste esponding environments and returns are observed, which finally guide the generator to updat olution for Undifferentiable Transition gh we have proved the equivalence between the transition optimization and the policy o In this paper, we consider a particular objective of MDP that the MDP acts as an83 environment minimizing the expected return of the agent, i.e. O(H) = P1 t=1 t 84 Thus, the objective function is formulated as:85 ✓⇤ = arg min ✓ max E[G|⇡ ; M✓ = hS, A, P✓, R, i]. This adversarial objective can be applied to design environments to analyse the weakness86 and its policy learning algorithms.87
  • 24. Controllable Environments: An example • Maze: • The AI agent: try to find an optimal strategy to find the way out. • The environment: generate a Maze to make it difficult to find a way Haifeng Zhang, Jun Wang , Zhiming Zhou , Weinan Zhang , Ying Wen , Wenxin Li, Learning to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
  • 25. Learning to design Environments (Games)
  • 26. Learning to Design Maze: Results DFS DQNOptimal RHS Haifeng Zhang, Jun Wang , Zhiming Zhou , Weinan Zhang , Ying Wen , Wenxin Li, Learning to Design Games: Strategic Environments in Deep Reinforcement Learning, 2017
  • 28. Thanks for your attention