Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1	
AlphaGo	in	Depth		
by	Mark	Chang	
1
Overview	
•  AI	in	Game	Playing	
•  Machine	Learning	and	Deep	Learning	
•  Reinforcement	Learning		
•  AlphaGo's	Methods		...
AI	in	Game	Playing	
•  AI	in	Game	Playing	
– Adversarial	Search	–	MiniMax		
– Monte	Carlo	Tree	Search	
– MulI-Armed	Bandit...
Game	Playing	
4
Adversarial	Search	–	MiniMax		
-1	 0	
1	 0	
0	 0	 -1	 -1	 -1	 -1	
1	 1	
5
Adversarial	Search	–	MiniMax		
-1	 0	
1	 0	
0	 0	 -1	 -1	 -1	 -1	
1	 1	
-1	 0	 -1	 -1	
1	 1	0	 -1	
1	-1	
6
Monte	Carlo	Tree	Search	
…….	
…….	
…….	
…….	
…….	
…….	
…….	
…….	
Tree	search	
Monte	Carlo	search	
…….	
7
Monte	Carlo	Tree	Search	
•  Tree	Search	+	Monte	Carlo	Method		
– SelecIon	
– Expansion	
– SimulaIon	
– Back-PropagaIon	
3/...
SelecIon	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
9
Expansion	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
0/0	
10
SimulaIon	
0/0	
…….	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
11
Back-PropagaIon	
4/6	
1/2	3/4	
0/1	2/3	1/1	 1/1	
2/2	 0/1	
1/1	
12
MulI-Armed	Bandit	Problem	
•  ExploraIon	v.s.	ExploitaIon	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
...
UCB1	algorithm		
argmax
i
( ¯xi +
r
2logn
ni
)
•  				:	The	mean	payout	for	machine	
•  				:	The	number	of	plays	of	machi...
UCB1	algorithm		
R(s, a) = Q(s, a) + c
s
logN(s)
N(s, a) 3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
a⇤
= argmax
a
R(s, a)
3...
UCB1	algorithm		
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
R(s, a1) = 2/3 + 0.5
r
log5
3
= 1.3029
R(s, a2) = 1/2 + 0.5
r
l...
Machine	Learning	and	Deep	Learning	
•  Machine	Learning	and	Deep	Learning	:	
– Supervised	Machine	Learning		
– Neural	Netw...
Supervised	Machine	Learning		
Machine		
Learning	
Model	
Problem:	 SoluIon:	Output:	
Problem:	 Output:	
Machine		
Learning...
Supervised	Machine	Learning		
ClassificaIon	 Regression	
Class	A	 Class	B	
Score	=	1	
Score	=	?	
Score	=	3		
19	
Class	?
Neural	Networks	
n
W1
W2
x1
x2
b
 Wb
nin = w1x1 + w2x2 + wb
Sigmoid:	
nout =
1
1 + e nin
nout
1 e 2nin
1 + e 2nin
tanh:	
R...
Neural	Networks	
•  AND	Gate	
x1
 x2
 y
0
 0
 0
0
 1
 0
1
 0
 0
1
 1
 1
(0,0)
(0,1)
 (1,1)
(1,0)
0
1
n
b
-30
y
x1	
x2	
20
...
Neural	Networks	
x
y
n11
n12
n21
n22
b b
z1
z2
W12,y
W12,x
W11,y
W11,b
W12,b
W11,x
 W21,11
W22,12
W21,12
W22,11
W21,b
W22,...
Neural	Networks	
•  XOR	Gate	
n
-20
20
b
-10
y
(0,0)
(0,1)
 (1,1)
(1,0)
0
1
(0,0)
(0,1)
 (1,1)
(1,0)
1
0
(0,0)
(0,1)
 (1,1...
MulI-Class	ClassificaIon	
•  SogMax		
n1
n2
n3
n1,out =
en1,in
en1,in + en2,in + en3,in
n2,out =
en2,in
en1,in + en2,in + e...
MulI-Class	ClassificaIon	
•  SogMax		
n1,out =
en1,in
en1,in + en2,in + en3,in
n1
n2
n3
n1,in
n2,in
n3,in
25	
n1,in n2,in a...
MulI-Class	ClassificaIon	
•  One-Hot	Encoding:	
Class	1	 Class	2	 Class	3	
1
0
0
0
1
0
0
0
1
n1
n2
n3
26
ConvoluIonal	Neural	Neworks	
depth	
width	width		depth	
weights	
height	
shared	weights	
27
ConvoluIonal	Neural	Neworks	
ConvoluIonal	
layer	
RecepIve	fields	 RecepIve	fields	
Input	
layer	
ConvoluIonal	
layer	
…….	…...
ConvoluIonal	Neural	Neworks	
Input	layer	
Filter	responses	Input	image	
Filters	in	convoluIonal	layers	
………..	
29
Training	Neural	Networks	
•  One-Hot	Encoding:	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	...
Training	Neural	Networks	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 ...
Training	Neural	Networks	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 ...
Cost	FuncIon	
w w
pw(ai|s) ⇡ 0
pw(ai|s) ⇡ 1
log(pw(ai|s))
33
Gradient	Descent	
w w
	Learning	Rate		
w = w ⌘
@ log(pw(ai|s))
@w
@ log(pw(ai|s))
@w
34
Gradient	Descent	
pw(ai|s)
s
w w ⌘
@log(pw(ai|s))
@w
35	
ai = 1Golden						is	1	
ai = 1Golden						is	0
Gradient	Descent	
36
Backward	PropagaIon	
n2
 n1
J
Cost		
funcIon:	
n2(out) n2(in) w21
37	
@J
@w21
=
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
...
Reinforcement	Learning		
•  Reinforcement	Learning	:		
– Policy	&	Value		
– Policy	Gradient	Method	
38
Reinforcement	Learning		
Reward	(Feedback)	
Reward	(Feedback)	
white	
win	
black	
win	
39
Reinforcement	Learning		
State:	 St
Reward	
(Feedback):	Rt
AcIon:	At
•  Feedback	is	delayed.	
•  No	supervisor,	only	a	rew...
•  The	behavior	of	an	agent	
	
	
Policy	
sstate	
a1acIon	 a2acIon	
⇡(a2 | s)
= 0.5
⇡(a1 | s)
= 0.5
StochasIc	Policy	
sstat...
Value	
•  The	expected	long-term	reward	
sstate	 q⇡(s, a)v⇡(s)
acIon	a⇡policy	
State-value	FuncIon		 AcIon-value	FuncIon		...
Policy	Gradient	Method	
•  REINFOCE	
– the	REward	Increment	=	NonnegaIve	Factor	Offset	
ReinforCEment	
	
weights	in		
polic...
Grid	World	Example	
4	x	4	Grid	World	
Terminal	
Reward	=1	
Terminal	
Reward	=	-1	
IniIal	
posiIon	
AcIon	
Agent	
44
Policy	Networks	
0	 0	 0	 0	
0	 1	 0	 0	
0	 0	 0	 0	
0	 0	 0	 0	
One-hot	
encoding	 ProbabiliIes	of	acIons	
Sampling	
Exec...
IniIalizaIon	
Neural	
Networks	
IniIalized	with	
random	weights	
⇡(a|s) for	every		s
46
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
47
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
Reward	r = 1
48
Backward	PropagaIon	
Neural	
Networks	
Reward	r = 1
49	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Backward	PropagaIon	
Neural	
Networks	
Reward	r = 1
50	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Next	IteraIon	
Neural	
Networks	
51
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
52
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
Reward		r = 1
53
Backward	PropagaIon	
Neural	
Networks	
Reward		r = 1
54	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Backward	PropagaIon	
Neural	
Networks	
Reward		r = 1
55	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Next	IteraIon	…		
Neural	
Networks	
56
Ager	Several	IteraIons	…	
57
AlphaGo’s	Methods	
•  Training:	
– Supervised	learning	:	ClassificaIon	
– Reinforcement	learning	
– Supervised	learning	:	R...
Training	
Human	expert	data	 Self-play	data	
Rollout	
policy	
SL	policy	
network	
RL	policy	
network	
Value	
network	
Clas...
Supervised	Learning	:	ClassificaIon	
Human	expert	data	
Rollout	
policy	
SL	policy	
network	
ClassificaIon	
p⇡ p
KGS	dataset...
Input/Output	Data	
0	 0	 0	
0	 1	 0	
0	 0	 0	
0	 1	 0	
0	 0	 1	
0	 1	 0	
1	 0	 1	
1	 0	 0	
1	 0	 1	
Input	
Next	
posiIon	
...
Symmetries	
62	
Input	
RotaIon		
90	degrees	
RotaIon		
180	degrees	
RotaIon		
270	degrees	
VerIcal	
reflecIon	
VerIcal	
refl...
SL	Policy	Network	
Input	
Size:	19x19	
48	planes	
First	layer	
Conv+ReLU	
Kernel	size:	5x5	
k	filters		
2nd	to	12th	layers	...
Supervised	Learning	:	ClassificaIon		
Input:	
Golden	
acIon:	
Backward		
propagaIon	
p
SL	Policy		
network:	
s ap (a|s)
Pro...
Reinforcement	Learning		
Self-play	data	
RL	policy	
network	
Policy	
gradient	
p⇢
50	GPUs,	1	day	
won	80%	SL	network	
10,0...
Reinforcement	Learning		
p⇢
SL	policy	
network	
p
p⇢
RL	policy	network	
Opponent	
IniIalize	
Weights	
⇢ = ⇢ =
play	 End	
P...
Policy	Gradient	Method	
p⇢ p⇢ p⇢ p⇢
p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT )
s1 s2 sT
r(sT )
…….	
…….	
Reward	Backward	propagaIon	
...
Supervised	learning	:	Regression	
Self-play	data	
RL	policy	
network	
Value	
network	
Regression	
Generate	
data	
p⇢
v✓
30...
Value	Network	
Input	
Size:	19x19	
48	planes	
14th	layer	
Fully-connected	
256	ReLU	unit	
1st	~	13th		
The	same	as	
policy...
Input/Output	Data	
…….	
Played	by	RL	policy	
						network	 p⇢
Played	by	SL	policy	
							network	p
Randomly	sample	an	in...
Supervised	learning	:	Regression	
Input:	
Golden	
value:	
Backward	propagaIon	
Value	
network:	
s v✓
v✓(s)
Output	
value:	...
Searching	
SelecIon	 Expansion	 EvaluaIon	 Backup	
Q + u
max	
r
v✓
Q + u
update	
p
p⇡
72
•  Each	edge	stores	a	set	of	staIsIcs		
•  														:	combined	mean	acIon	value		
•  														:	prior	probabilit...
SelecIon	
a⇤
= argmax
a
(Q(s, a) + u(s, a))
u(s, a) = cP(s, a)
pP
b Nr(s, b)
1 + Nr(s, a)
Choose	acIon		
a⇤
PUCT	Algorithm...
Expansion	
s
a
s0
Insert	the	node	for	the	successor	
state					.		s0
1	
2	
Nv(s0
, a0
) = Nr(s0
, a0
) = 0
Wr(s0
, a0
) = ...
EvaluaIon	
p⇡
1	
2	 Simulate	the	acIon	by		
rollout	policy	network								.	p⇡
Evaluate														by	value	network					...
Backup	
s
a
r(sT )
v✓(s0
)s0
Update	the	staIsIcs	of	every	
visited	edge												:																(s, a)
Nr(s, a) Nr(s, ...
Distributed	Search		
p⇡
r(sT )
v✓(s0
)
p (a0
|s0
)
Main	search	tree	
Master	CPU	
Policy	&	value	networks	
176	GPUs	
Rollou...
Reference	
•  Mastering	the	game	of	Go	with	deep	neural	
networks	and	tree	search	
– hqp://www.nature.com/nature/journal/v...
Further	Reading	
•  Monte	Carlo	Tree	Search	
– hqps://jesradberry.com/posts/2015/09/intro-to-
monte-carlo-tree-search/	
• ...
About	the	Speaker	
•  Email:	ckmarkoh	at	gmail	dot	com	
•  Blog:	hqp://cpmarkchang.logdown.com	
•  Github:	hqps://github.c...
Upcoming SlideShare
Loading in …5
×
Upcoming SlideShare
How AlphaGo Works
Next
Download to read offline and view in fullscreen.

171

Share

Download to read offline

AlphaGo in Depth

Download to read offline

slides for 3/21 MLDM Monday
http://www.meetup.com/Taiwan-R/events/229386172/

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

AlphaGo in Depth

  1. 1. 1 AlphaGo in Depth by Mark Chang 1
  2. 2. Overview •  AI in Game Playing •  Machine Learning and Deep Learning •  Reinforcement Learning •  AlphaGo's Methods 2
  3. 3. AI in Game Playing •  AI in Game Playing – Adversarial Search – MiniMax – Monte Carlo Tree Search – MulI-Armed Bandit Problem 3
  4. 4. Game Playing 4
  5. 5. Adversarial Search – MiniMax -1 0 1 0 0 0 -1 -1 -1 -1 1 1 5
  6. 6. Adversarial Search – MiniMax -1 0 1 0 0 0 -1 -1 -1 -1 1 1 -1 0 -1 -1 1 1 0 -1 1 -1 6
  7. 7. Monte Carlo Tree Search ……. ……. ……. ……. ……. ……. ……. ……. Tree search Monte Carlo search ……. 7
  8. 8. Monte Carlo Tree Search •  Tree Search + Monte Carlo Method – SelecIon – Expansion – SimulaIon – Back-PropagaIon 3/5 1/2 2/3 0/1 white wins / total 1/2 1/1 1/1 1/1 0/1 8
  9. 9. SelecIon 3/5 1/2 2/3 0/1 1/2 1/1 1/1 1/1 0/1 9
  10. 10. Expansion 3/5 1/2 2/3 0/1 1/2 1/1 1/1 1/1 0/1 0/0 10
  11. 11. SimulaIon 0/0 ……. 3/5 1/2 2/3 0/1 1/2 1/1 1/1 1/1 0/1 11
  12. 12. Back-PropagaIon 4/6 1/2 3/4 0/1 2/3 1/1 1/1 2/2 0/1 1/1 12
  13. 13. MulI-Armed Bandit Problem •  ExploraIon v.s. ExploitaIon 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 F G H F G H F G H F G H F G H F G H F G H F G H F G H 13
  14. 14. UCB1 algorithm argmax i ( ¯xi + r 2logn ni ) •  : The mean payout for machine •  : The number of plays of machine •  : The total number of plays ¯xi + r 2logn ni¯xi + r 2logn ni i i n 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 6 7 8 F G H F G H F G H 6 7 8 6 7 8 6 7 8 F G H F G H F G H F G H F G H F G H 14
  15. 15. UCB1 algorithm R(s, a) = Q(s, a) + c s logN(s) N(s, a) 3/5 1/2 2/3 0/1 1/2 1/1 1/1 1/1 0/1 a⇤ = argmax a R(s, a) 3/5 2/3 N(s, a) = 3 N(s) = 5 c = constant Q(s, a) = 2/3 15
  16. 16. UCB1 algorithm 3/5 1/2 2/3 0/1 1/2 1/1 1/1 1/1 0/1 R(s, a1) = 2/3 + 0.5 r log5 3 = 1.3029 R(s, a2) = 1/2 + 0.5 r log5 2 = 0.9485 s a1 a2 R(s, a1) > R(s, a2) a⇤ = argmax a R(s, a) = a1 16
  17. 17. Machine Learning and Deep Learning •  Machine Learning and Deep Learning : – Supervised Machine Learning – Neural Networks – ConvoluIonal Neural Networks – Training Neural Networks 17
  18. 18. Supervised Machine Learning Machine Learning Model Problem: SoluIon: Output: Problem: Output: Machine Learning Model Feedback 18
  19. 19. Supervised Machine Learning ClassificaIon Regression Class A Class B Score = 1 Score = ? Score = 3 19 Class ?
  20. 20. Neural Networks n W1 W2 x1 x2 b Wb nin = w1x1 + w2x2 + wb Sigmoid: nout = 1 1 + e nin nout 1 e 2nin 1 + e 2nin tanh: ReLU: ⇢ nin if nin > 0 0 otherwise 20
  21. 21. Neural Networks •  AND Gate x1 x2 y 0 0 0 0 1 0 1 0 0 1 1 1 (0,0) (0,1) (1,1) (1,0) 0 1 n b -30 y x1 x2 20 20 y = 1 1 + e (20x1+20x2 30) 20x1 + 20x2 30 = 0 21
  22. 22. Neural Networks x y n11 n12 n21 n22 b b z1 z2 W12,y W12,x W11,y W11,b W12,b W11,x W21,11 W22,12 W21,12 W22,11 W21,b W22,b Input layer Hidden layer Output layer 22
  23. 23. Neural Networks •  XOR Gate n -20 20 b -10 y (0,0) (0,1) (1,1) (1,0) 0 1 (0,0) (0,1) (1,1) (1,0) 1 0 (0,0) (0,1) (1,1) (1,0) 0 0 1 n1 b -30 20 20 x1 x2 n2 b -10 20 20 x1 x2 x1 x2 n1 n2 y 0 0 0 0 0 0 1 0 1 1 1 0 0 1 1 1 1 1 1 0 23
  24. 24. MulI-Class ClassificaIon •  SogMax n1 n2 n3 n1,out = en1,in en1,in + en2,in + en3,in n2,out = en2,in en1,in + en2,in + en3,in n3,out = en3,in en1,in + en2,in + en3,in n1,in n2,in n3,in 24
  25. 25. MulI-Class ClassificaIon •  SogMax n1,out = en1,in en1,in + en2,in + en3,in n1 n2 n3 n1,in n2,in n3,in 25 n1,in n2,in and n1,in n3,in n1,in ⌧ n2,in or n1,in ⌧ n3,in
  26. 26. MulI-Class ClassificaIon •  One-Hot Encoding: Class 1 Class 2 Class 3 1 0 0 0 1 0 0 0 1 n1 n2 n3 26
  27. 27. ConvoluIonal Neural Neworks depth width width depth weights height shared weights 27
  28. 28. ConvoluIonal Neural Neworks ConvoluIonal layer RecepIve fields RecepIve fields Input layer ConvoluIonal layer ……. ……. 28
  29. 29. ConvoluIonal Neural Neworks Input layer Filter responses Input image Filters in convoluIonal layers ……….. 29
  30. 30. Training Neural Networks •  One-Hot Encoding: 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Player’s stones Opponent’s stones Empty posiIons Next posiIon Input Output 30
  31. 31. Training Neural Networks 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 .5 0 0 0 0 .3 0 0 0 0 .2 0 0 0 0 0 0 0 0 Forward propagaIon pw(a|s) Inputs: Input layer ConvoluIonal layer Output layer Outputs: s 31
  32. 32. Training Neural Networks 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 0 1 1 1 1 0 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 .5 0 0 0 0 .3 0 0 0 0 .2 0 0 0 0 0 0 0 0 Inputs: Input layer ConvoluIonal layer Output layer Outputs: s pw(a|s) 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 Golden: Backward propagaIon Cost funcIon: ai w = w ⌘ @ log(pw(ai|s)) @w 32 log(pw(ai|s))
  33. 33. Cost FuncIon w w pw(ai|s) ⇡ 0 pw(ai|s) ⇡ 1 log(pw(ai|s)) 33
  34. 34. Gradient Descent w w Learning Rate w = w ⌘ @ log(pw(ai|s)) @w @ log(pw(ai|s)) @w 34
  35. 35. Gradient Descent pw(ai|s) s w w ⌘ @log(pw(ai|s)) @w 35 ai = 1Golden is 1 ai = 1Golden is 0
  36. 36. Gradient Descent 36
  37. 37. Backward PropagaIon n2 n1 J Cost funcIon: n2(out) n2(in) w21 37 @J @w21 = @J @n2(out) @n2(out) @n2(in) @n2(in) @w21 w21 w21 ⌘ @J @w21 w21 w21 ⌘ @J @n2(out) @n2(out) @n2(in) @n2(in) @w21
  38. 38. Reinforcement Learning •  Reinforcement Learning : – Policy & Value – Policy Gradient Method 38
  39. 39. Reinforcement Learning Reward (Feedback) Reward (Feedback) white win black win 39
  40. 40. Reinforcement Learning State: St Reward (Feedback): Rt AcIon: At •  Feedback is delayed. •  No supervisor, only a reward signal. •  Rules of the game are unknown. •  Agent’s acIons affect the subsequent state Agent Environment 40
  41. 41. •  The behavior of an agent Policy sstate a1acIon a2acIon ⇡(a2 | s) = 0.5 ⇡(a1 | s) = 0.5 StochasIc Policy sstate DeterminisIc Policy ⇡(s) = a acIon a 41
  42. 42. Value •  The expected long-term reward sstate q⇡(s, a)v⇡(s) acIon a⇡policy State-value FuncIon AcIon-value FuncIon rreward: ⇡policy sstate rreward: ⇡policy end end 42
  43. 43. Policy Gradient Method •  REINFOCE – the REward Increment = NonnegaIve Factor Offset ReinforCEment weights in policy funcIon reward baseline (usually = 0) learning rate 43 w w + ↵(r b) @log⇡(a|s) @w
  44. 44. Grid World Example 4 x 4 Grid World Terminal Reward =1 Terminal Reward = -1 IniIal posiIon AcIon Agent 44
  45. 45. Policy Networks 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 One-hot encoding ProbabiliIes of acIons Sampling Execute acIon Neural Networks s ⇡(a|s) 45
  46. 46. IniIalizaIon Neural Networks IniIalized with random weights ⇡(a|s) for every s 46
  47. 47. Forward PropagaIon Neural Networks ⇡(a|s) s 47
  48. 48. Forward PropagaIon Neural Networks ⇡(a|s) s Reward r = 1 48
  49. 49. Backward PropagaIon Neural Networks Reward r = 1 49 w w + ↵(r b) @log⇡(a|s) @w r @log⇡(a|s) @w
  50. 50. Backward PropagaIon Neural Networks Reward r = 1 50 w w + ↵(r b) @log⇡(a|s) @w r @log⇡(a|s) @w
  51. 51. Next IteraIon Neural Networks 51
  52. 52. Forward PropagaIon Neural Networks ⇡(a|s) s 52
  53. 53. Forward PropagaIon Neural Networks ⇡(a|s) s Reward r = 1 53
  54. 54. Backward PropagaIon Neural Networks Reward r = 1 54 w w + ↵(r b) @log⇡(a|s) @w r @log⇡(a|s) @w
  55. 55. Backward PropagaIon Neural Networks Reward r = 1 55 w w + ↵(r b) @log⇡(a|s) @w r @log⇡(a|s) @w
  56. 56. Next IteraIon … Neural Networks 56
  57. 57. Ager Several IteraIons … 57
  58. 58. AlphaGo’s Methods •  Training: – Supervised learning : ClassificaIon – Reinforcement learning – Supervised learning : Regression •  Searching: – Searching with policy and value networks – Distributed search 58
  59. 59. Training Human expert data Self-play data Rollout policy SL policy network RL policy network Value network ClassificaIon Regression Policy gradient Generate data IniIalize weights p⇡ p p⇢ v✓ 59
  60. 60. Supervised Learning : ClassificaIon Human expert data Rollout policy SL policy network ClassificaIon p⇡ p KGS dataset 160,000 games 29.4 million posiIons linear-sogmax network (faster but less accurate) 13-layers convoluIonal neural network 50 GPUs, 3 weeks Accuracy : 57.0% 60
  61. 61. Input/Output Data 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 1 0 1 1 0 0 1 0 1 Input Next posiIon Output 0 0 0 1 0 0 0 0 0 Stone color: 3 planes player, opponent, empty Liberty: 8 planes 1~8 liberIes Stone color, Liberty, Turns since, Capture size, Self-atari size, Ladder capture, Ladder escape, Sensibleness. Total: 48 planes 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61
  62. 62. Symmetries 62 Input RotaIon 90 degrees RotaIon 180 degrees RotaIon 270 degrees VerIcal reflecIon VerIcal reflecIon VerIcal reflecIon VerIcal reflecIon
  63. 63. SL Policy Network Input Size: 19x19 48 planes First layer Conv+ReLU Kernel size: 5x5 k filters 2nd to 12th layers Conv+ReLU Kernel size: 3x3 k filters 13th layer Kernel size: 1x1 1 filters Sogmax k = 192 63
  64. 64. Supervised Learning : ClassificaIon Input: Golden acIon: Backward propagaIon p SL Policy network: s ap (a|s) ProbabiliIes of acIons: Learning rate ⌘ @ logp (a|s) @ 64
  65. 65. Reinforcement Learning Self-play data RL policy network Policy gradient p⇢ 50 GPUs, 1 day won 80% SL network 10,000 x 128 games Weights iniIalized by SL policy network 65
  66. 66. Reinforcement Learning p⇢ SL policy network p p⇢ RL policy network Opponent IniIalize Weights ⇢ = ⇢ = play End Policy Gradient method p⇢ Opponent pool Add to opponent pool p⇢ r reward 66
  67. 67. Policy Gradient Method p⇢ p⇢ p⇢ p⇢ p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT ) s1 s2 sT r(sT ) ……. ……. Reward Backward propagaIon Learning rate Baseline 67 ⇢ ! ⇢ + ↵ TX i=1 @logp⇢(ai|si) @⇢ (r(sT ) b(st))
  68. 68. Supervised learning : Regression Self-play data RL policy network Value network Regression Generate data p⇢ v✓ 30 million posiIons 50 GPUs, 1 week MSE: 0.226 iniIalize weights 15-layers convoluIonal neural network 68
  69. 69. Value Network Input Size: 19x19 48 planes 14th layer Fully-connected 256 ReLU unit 1st ~ 13th The same as policy networks 15th layer Fully-connected 1 tanh unit +1 unit (current color) 69
  70. 70. Input/Output Data ……. Played by RL policy network p⇢ Played by SL policy network p Randomly sample an integer U in 1~450 1 2 t = 1 t = U-1 t = U+1 end Random acIon t = U 3 Generate Training Example: sU+1state: rreward: value: zU+1 (sU+1, zU+1) ……. 70
  71. 71. Supervised learning : Regression Input: Golden value: Backward propagaIon Value network: s v✓ v✓(s) Output value: z ✓ ✓ + ⌘(z v✓(s)) @v✓(s) @✓ 71
  72. 72. Searching SelecIon Expansion EvaluaIon Backup Q + u max r v✓ Q + u update p p⇡ 72
  73. 73. •  Each edge stores a set of staIsIcs •  : combined mean acIon value •  : prior probability evaluated by •  : esImated acIon value by •  : esImated acIon value by •  : counts of evaluaIons by •  : counts of evaluaIons by Searching P(s, a) Nv(s, a) Nr(s, a) Wr(s, a) Wv(s, a) Q(s, a) s a v✓(s) p⇡(a|s) v✓(s) p⇡(a|s) p (a|s) 73
  74. 74. SelecIon a⇤ = argmax a (Q(s, a) + u(s, a)) u(s, a) = cP(s, a) pP b Nr(s, b) 1 + Nr(s, a) Choose acIon a⇤ PUCT Algorithm: ExploitaIon ExploraIon Visit counts of parent node s Visit counts of edge (s, a) Level of exploraIon s 74
  75. 75. Expansion s a s0 Insert the node for the successor state . s0 1 2 Nv(s0 , a0 ) = Nr(s0 , a0 ) = 0 Wr(s0 , a0 ) = Wv(s0 , a0 ) = 0 P(s0 , a0 ) = p (a0 |s0 ) p (a0 |s0 ) If visit count exceed a threshold : , Nr(s, a) > nthr a0 a0 For every possible , iniIalize the staIsIcs: a0 75
  76. 76. EvaluaIon p⇡ 1 2 Simulate the acIon by rollout policy network . p⇡ Evaluate by value network . v✓(s0 ) v✓ r(sT ) v✓(s0 ) When reaching terminal , calculate the reward . sT r(sT ) 76
  77. 77. Backup s a r(sT ) v✓(s0 )s0 Update the staIsIcs of every visited edge : (s, a) Nr(s, a) Nr(s, a) + 1 Wr(s, a) Wr(s, a) + r(sT ) Nv(s, a) Nv(s, a) + 1 Wv(s, a) Wv(s, a) + v✓(s0 ) Q(s, a) = (1 ) Wv(s, a) Nv(s, a) + Wr(s, a) Nr(s, a) InterpolaIon constant 77
  78. 78. Distributed Search p⇡ r(sT ) v✓(s0 ) p (a0 |s0 ) Main search tree Master CPU Policy & value networks 176 GPUs Rollout policy networks 1,202 CPUs 78
  79. 79. Reference •  Mastering the game of Go with deep neural networks and tree search – hqp://www.nature.com/nature/journal/v529/n7587/ full/nature16961.html 79
  80. 80. Further Reading •  Monte Carlo Tree Search – hqps://jesradberry.com/posts/2015/09/intro-to- monte-carlo-tree-search/ •  Neural Networks Backward PropagaIon – hqp://cpmarkchang.logdown.com/posts/277349- neural-network-backward-propagaIon •  ConvoluIonal Neural Networks – hqp://cs231n.github.io/convoluIonal-networks/ •  Policy Gradient Method: REINFORCE – hqps://www.cs.cmu.edu/afs/cs/project/jair/pub/ volume4/kaelbling96a-html/node37.html 80
  81. 81. About the Speaker •  Email: ckmarkoh at gmail dot com •  Blog: hqp://cpmarkchang.logdown.com •  Github: hqps://github.com/ckmarkoh F.C.C Mark Chang •  Facebook: hqps://www.facebook.com/ckmarkoh.chang •  Slideshare: hqp://www.slideshare.net/ckmarkohchang •  Linkedin: hqps://www.linkedin.com/pub/mark-chang/85/25b/847 81
  • ssuser647287

    Nov. 10, 2019
  • ssuserafab89

    Jun. 7, 2019
  • CYChen5

    Jan. 19, 2019
  • Anton0125

    Sep. 4, 2018
  • axtixixg

    Feb. 17, 2018
  • ScottONeil6

    Oct. 23, 2017
  • ArchanaMolasi1

    Oct. 16, 2017
  • ChialinLee5

    Sep. 3, 2017
  • ssuser1ea383

    Jun. 18, 2017
  • alucard0001

    Jun. 16, 2017
  • DubaiEscortBunniesUAE

    May. 30, 2017
  • ssuser038003

    Feb. 23, 2017
  • sibuzu

    Dec. 29, 2016
  • bookwormkr

    Dec. 6, 2016
  • yoyo420303

    Nov. 19, 2016
  • MihirDesu

    Oct. 12, 2016
  • HRLin

    Jul. 30, 2016
  • Mightythor

    Jul. 24, 2016
  • denverliu1

    Jul. 21, 2016
  • YuDeLin

    Jun. 26, 2016

slides for 3/21 MLDM Monday http://www.meetup.com/Taiwan-R/events/229386172/

Views

Total views

22,406

On Slideshare

0

From embeds

0

Number of embeds

349

Actions

Downloads

1,154

Shares

0

Comments

0

Likes

171

×