SlideShare a Scribd company logo
1 of 81
Download to read offline
1	
AlphaGo	in	Depth		
by	Mark	Chang	
1
Overview	
•  AI	in	Game	Playing	
•  Machine	Learning	and	Deep	Learning	
•  Reinforcement	Learning		
•  AlphaGo's	Methods		
2
AI	in	Game	Playing	
•  AI	in	Game	Playing	
– Adversarial	Search	–	MiniMax		
– Monte	Carlo	Tree	Search	
– MulI-Armed	Bandit	Problem	
3
Game	Playing	
4
Adversarial	Search	–	MiniMax		
-1	 0	
1	 0	
0	 0	 -1	 -1	 -1	 -1	
1	 1	
5
Adversarial	Search	–	MiniMax		
-1	 0	
1	 0	
0	 0	 -1	 -1	 -1	 -1	
1	 1	
-1	 0	 -1	 -1	
1	 1	0	 -1	
1	-1	
6
Monte	Carlo	Tree	Search	
…….	
…….	
…….	
…….	
…….	
…….	
…….	
…….	
Tree	search	
Monte	Carlo	search	
…….	
7
Monte	Carlo	Tree	Search	
•  Tree	Search	+	Monte	Carlo	Method		
– SelecIon	
– Expansion	
– SimulaIon	
– Back-PropagaIon	
3/5	
1/2	2/3	
0/1	
white	wins	/	total	
1/2	1/1	 1/1	
1/1	 0/1	
8
SelecIon	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
9
Expansion	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
0/0	
10
SimulaIon	
0/0	
…….	
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
11
Back-PropagaIon	
4/6	
1/2	3/4	
0/1	2/3	1/1	 1/1	
2/2	 0/1	
1/1	
12
MulI-Armed	Bandit	Problem	
•  ExploraIon	v.s.	ExploitaIon	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
13
UCB1	algorithm		
argmax
i
( ¯xi +
r
2logn
ni
)
•  				:	The	mean	payout	for	machine	
•  				:	The	number	of	plays	of	machine	
•  				:	The	total	number	of	plays	
¯xi +
r
2logn
ni¯xi +
r
2logn
ni
i
i
n
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
6	
7	
8	
F	
G	
H	
F	
G	
H	
F	
G	
H	
6	
7	
8	
6	
7	
8	
6	
7	
8	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
F	
G	
H	
14
UCB1	algorithm		
R(s, a) = Q(s, a) + c
s
logN(s)
N(s, a) 3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
a⇤
= argmax
a
R(s, a)
3/5	
2/3	 N(s, a) = 3
N(s) = 5
c = constant
Q(s, a) = 2/3
15
UCB1	algorithm		
3/5	
1/2	2/3	
0/1	1/2	1/1	 1/1	
1/1	 0/1	
R(s, a1) = 2/3 + 0.5
r
log5
3
= 1.3029
R(s, a2) = 1/2 + 0.5
r
log5
2
= 0.9485
s
a1 a2
R(s, a1) > R(s, a2)
a⇤
= argmax
a
R(s, a) = a1
16
Machine	Learning	and	Deep	Learning	
•  Machine	Learning	and	Deep	Learning	:	
– Supervised	Machine	Learning		
– Neural	Networks		
– ConvoluIonal	Neural	Networks	
– Training	Neural	Networks	
17
Supervised	Machine	Learning		
Machine		
Learning	
Model	
Problem:	 SoluIon:	Output:	
Problem:	 Output:	
Machine		
Learning	
Model	
Feedback	
18
Supervised	Machine	Learning		
ClassificaIon	 Regression	
Class	A	 Class	B	
Score	=	1	
Score	=	?	
Score	=	3		
19	
Class	?
Neural	Networks	
n
W1
W2
x1
x2
b
 Wb
nin = w1x1 + w2x2 + wb
Sigmoid:	
nout =
1
1 + e nin
nout
1 e 2nin
1 + e 2nin
tanh:	
ReLU:	⇢
nin if nin > 0
0 otherwise
20
Neural	Networks	
•  AND	Gate	
x1
 x2
 y
0
 0
 0
0
 1
 0
1
 0
 0
1
 1
 1
(0,0)
(0,1)
 (1,1)
(1,0)
0
1
n
b
-30
y
x1	
x2	
20
20
y =
1
1 + e (20x1+20x2 30)
20x1 + 20x2 30 = 0
21
Neural	Networks	
x
y
n11
n12
n21
n22
b b
z1
z2
W12,y
W12,x
W11,y
W11,b
W12,b
W11,x
 W21,11
W22,12
W21,12
W22,11
W21,b
W22,b
Input		
layer
Hidden	
layer
Output	
layer
22
Neural	Networks	
•  XOR	Gate	
n
-20
20
b
-10
y
(0,0)
(0,1)
 (1,1)
(1,0)
0
1
(0,0)
(0,1)
 (1,1)
(1,0)
1
0
(0,0)
(0,1)
 (1,1)
(1,0)
0
0
1
n1
b
-30
20
20
x1	
x2	
n2
b
-10
20
20
x1	
x2	
x1
 x2
 n1
 n2
 y
0
 0
 0
 0
 0
0
 1
 0
 1
 1
1
 0
 0
 1
 1
1
 1
 1
 1
 0
23
MulI-Class	ClassificaIon	
•  SogMax		
n1
n2
n3
n1,out =
en1,in
en1,in + en2,in + en3,in
n2,out =
en2,in
en1,in + en2,in + en3,in
n3,out =
en3,in
en1,in + en2,in + en3,in
n1,in
n2,in
n3,in
24
MulI-Class	ClassificaIon	
•  SogMax		
n1,out =
en1,in
en1,in + en2,in + en3,in
n1
n2
n3
n1,in
n2,in
n3,in
25	
n1,in n2,in and
n1,in n3,in
n1,in ⌧ n2,in or
n1,in ⌧ n3,in
MulI-Class	ClassificaIon	
•  One-Hot	Encoding:	
Class	1	 Class	2	 Class	3	
1
0
0
0
1
0
0
0
1
n1
n2
n3
26
ConvoluIonal	Neural	Neworks	
depth	
width	width		depth	
weights	
height	
shared	weights	
27
ConvoluIonal	Neural	Neworks	
ConvoluIonal	
layer	
RecepIve	fields	 RecepIve	fields	
Input	
layer	
ConvoluIonal	
layer	
…….	…….	
28
ConvoluIonal	Neural	Neworks	
Input	layer	
Filter	responses	Input	image	
Filters	in	convoluIonal	layers	
………..	
29
Training	Neural	Networks	
•  One-Hot	Encoding:	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 1	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
1	 1	 1	 1	 1	
1	 1	 0	 1	 1	
1	 1	 0	 0	 1	
1	 1	 0	 1	 1	
1	 1	 1	 1	 1	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 1	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
Player’s		
stones	
Opponent’s		
stones	
Empty	
posiIons	
Next	
posiIon	
Input	 Output	
30
Training	Neural	Networks	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 1	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
1	 1	 1	 1	 1	
1	 1	 0	 1	 1	
1	 1	 0	 0	 1	
1	 1	 0	 1	 1	
1	 1	 1	 1	 1	
0	 0	 0	 0	 0	
0	 .5	 0	 0	 0	
0	 .3	 0	 0	 0	
0	 .2	 0	 0	 0	
0	 0	 0	 0	 0	
Forward	propagaIon	
pw(a|s)
Inputs:		
Input		
layer	
ConvoluIonal	
layer	
Output	
layer	
Outputs:	
s
31
Training	Neural	Networks	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 1	 0	 0	
0	 0	 0	 1	 0	
0	 0	 1	 0	 0	
0	 0	 0	 0	 0	
1	 1	 1	 1	 1	
1	 1	 0	 1	 1	
1	 1	 0	 0	 1	
1	 1	 0	 1	 1	
1	 1	 1	 1	 1	
0	 0	 0	 0	 0	
0	 .5	 0	 0	 0	
0	 .3	 0	 0	 0	
0	 .2	 0	 0	 0	
0	 0	 0	 0	 0	
Inputs:		
Input		
layer	
ConvoluIonal	
layer	
Output	
layer	
Outputs:	
s
pw(a|s)
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
0	 1	 0	 0	 0	
0	 0	 0	 0	 0	
0	 0	 0	 0	 0	
Golden:	
Backward	propagaIon	
Cost	funcIon:	
ai
w = w ⌘
@ log(pw(ai|s))
@w
32	
log(pw(ai|s))
Cost	FuncIon	
w w
pw(ai|s) ⇡ 0
pw(ai|s) ⇡ 1
log(pw(ai|s))
33
Gradient	Descent	
w w
	Learning	Rate		
w = w ⌘
@ log(pw(ai|s))
@w
@ log(pw(ai|s))
@w
34
Gradient	Descent	
pw(ai|s)
s
w w ⌘
@log(pw(ai|s))
@w
35	
ai = 1Golden						is	1	
ai = 1Golden						is	0
Gradient	Descent	
36
Backward	PropagaIon	
n2
 n1
J
Cost		
funcIon:	
n2(out) n2(in) w21
37	
@J
@w21
=
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
w21 w21 ⌘
@J
@w21
w21 w21 ⌘
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
Reinforcement	Learning		
•  Reinforcement	Learning	:		
– Policy	&	Value		
– Policy	Gradient	Method	
38
Reinforcement	Learning		
Reward	(Feedback)	
Reward	(Feedback)	
white	
win	
black	
win	
39
Reinforcement	Learning		
State:	 St
Reward	
(Feedback):	Rt
AcIon:	At
•  Feedback	is	delayed.	
•  No	supervisor,	only	a	reward	signal.	
•  Rules	of	the	game	are	unknown.	
•  Agent’s	acIons	affect	the	subsequent	state	
Agent	
Environment	
40
•  The	behavior	of	an	agent	
	
	
Policy	
sstate	
a1acIon	 a2acIon	
⇡(a2 | s)
= 0.5
⇡(a1 | s)
= 0.5
StochasIc	Policy	
sstate	
DeterminisIc	Policy	
⇡(s) = a
acIon	a
41
Value	
•  The	expected	long-term	reward	
sstate	 q⇡(s, a)v⇡(s)
acIon	a⇡policy	
State-value	FuncIon		 AcIon-value	FuncIon		
rreward:	
⇡policy	
sstate	
rreward:	
⇡policy	
end		 end		
42
Policy	Gradient	Method	
•  REINFOCE	
– the	REward	Increment	=	NonnegaIve	Factor	Offset	
ReinforCEment	
	
weights	in		
policy	funcIon		
reward	 baseline	
(usually	=	0)	
learning	
rate	
43	
w w + ↵(r b)
@log⇡(a|s)
@w
Grid	World	Example	
4	x	4	Grid	World	
Terminal	
Reward	=1	
Terminal	
Reward	=	-1	
IniIal	
posiIon	
AcIon	
Agent	
44
Policy	Networks	
0	 0	 0	 0	
0	 1	 0	 0	
0	 0	 0	 0	
0	 0	 0	 0	
One-hot	
encoding	 ProbabiliIes	of	acIons	
Sampling	
Execute	
acIon	
Neural	
Networks	
s ⇡(a|s)
45
IniIalizaIon	
Neural	
Networks	
IniIalized	with	
random	weights	
⇡(a|s) for	every		s
46
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
47
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
Reward	r = 1
48
Backward	PropagaIon	
Neural	
Networks	
Reward	r = 1
49	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Backward	PropagaIon	
Neural	
Networks	
Reward	r = 1
50	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Next	IteraIon	
Neural	
Networks	
51
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
52
Forward	PropagaIon	
Neural	
Networks	
⇡(a|s)
s
Reward		r = 1
53
Backward	PropagaIon	
Neural	
Networks	
Reward		r = 1
54	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Backward	PropagaIon	
Neural	
Networks	
Reward		r = 1
55	
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w
Next	IteraIon	…		
Neural	
Networks	
56
Ager	Several	IteraIons	…	
57
AlphaGo’s	Methods	
•  Training:	
– Supervised	learning	:	ClassificaIon	
– Reinforcement	learning	
– Supervised	learning	:	Regression	
•  Searching:		
– Searching	with	policy	and	value	networks	
– Distributed	search		
58
Training	
Human	expert	data	 Self-play	data	
Rollout	
policy	
SL	policy	
network	
RL	policy	
network	
Value	
network	
ClassificaIon	 Regression	
Policy	
gradient	
Generate	
data	
IniIalize	
weights	
p⇡ p p⇢ v✓
59
Supervised	Learning	:	ClassificaIon	
Human	expert	data	
Rollout	
policy	
SL	policy	
network	
ClassificaIon	
p⇡ p
KGS	dataset			
160,000	games	
29.4	million	posiIons		
linear-sogmax	network		
(faster	but	less	accurate)	
13-layers	convoluIonal	
neural	network	
50	GPUs,	3	weeks	
Accuracy	:	57.0%		
60
Input/Output	Data	
0	 0	 0	
0	 1	 0	
0	 0	 0	
0	 1	 0	
0	 0	 1	
0	 1	 0	
1	 0	 1	
1	 0	 0	
1	 0	 1	
Input	
Next	
posiIon	
Output	
0	 0	 0	
1	 0	 0	
0	 0	 0	
Stone	color:	
3	planes	
player,	opponent,	empty	
Liberty:	
8	planes	
1~8	liberIes	
Stone	color,	Liberty,	Turns	
since,	Capture	size,	
Self-atari	size,	Ladder	capture,	
Ladder	escape,	Sensibleness.	
Total:	48	planes	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 1	 0	
0	 0	 0	
0	 1	 0	
0	 0	 1	
0	 1	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
0	 0	 0	
61
Symmetries	
62	
Input	
RotaIon		
90	degrees	
RotaIon		
180	degrees	
RotaIon		
270	degrees	
VerIcal	
reflecIon	
VerIcal	
reflecIon	
VerIcal	
reflecIon	
VerIcal	
reflecIon
SL	Policy	Network	
Input	
Size:	19x19	
48	planes	
First	layer	
Conv+ReLU	
Kernel	size:	5x5	
k	filters		
2nd	to	12th	layers	
Conv+ReLU	
Kernel	size:	3x3	
k	filters		
13th	layer	
Kernel	size:	1x1	
1	filters		
Sogmax		
k	=	192	
63
Supervised	Learning	:	ClassificaIon		
Input:	
Golden	
acIon:	
Backward		
propagaIon	
p
SL	Policy		
network:	
s ap (a|s)
ProbabiliIes	
of	acIons:	
Learning	rate	
⌘
@ logp (a|s)
@
64
Reinforcement	Learning		
Self-play	data	
RL	policy	
network	
Policy	
gradient	
p⇢
50	GPUs,	1	day	
won	80%	SL	network	
10,000	x	128	games	
Weights		iniIalized	
by	SL	policy	network	
65
Reinforcement	Learning		
p⇢
SL	policy	
network	
p
p⇢
RL	policy	network	
Opponent	
IniIalize	
Weights	
⇢ = ⇢ =
play	 End	
Policy	
Gradient	
method	
p⇢
Opponent	pool	
Add							to	
opponent	pool	
p⇢
r
reward	
66
Policy	Gradient	Method	
p⇢ p⇢ p⇢ p⇢
p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT )
s1 s2 sT
r(sT )
…….	
…….	
Reward	Backward	propagaIon	
Learning	rate	 Baseline	
67	
⇢ ! ⇢ + ↵
TX
i=1
@logp⇢(ai|si)
@⇢
(r(sT ) b(st))
Supervised	learning	:	Regression	
Self-play	data	
RL	policy	
network	
Value	
network	
Regression	
Generate	
data	
p⇢
v✓
30	million	posiIons	
50	GPUs,	1	week	
MSE:	0.226		
iniIalize	
weights	
15-layers	convoluIonal	
neural	network	
68
Value	Network	
Input	
Size:	19x19	
48	planes	
14th	layer	
Fully-connected	
256	ReLU	unit	
1st	~	13th		
The	same	as	
policy	networks	
15th	layer	
Fully-connected	
1	tanh	unit	
+1	unit		
(current	color)	
69
Input/Output	Data	
…….	
Played	by	RL	policy	
						network	 p⇢
Played	by	SL	policy	
							network	p
Randomly	sample	an	integer	U	in	1~450	1	
2	
t	=	1	 t	=	U-1	 t	=	U+1	 end	
Random	
acIon	
t	=	U	
3	 Generate	Training	Example:		
sU+1state:	 rreward:	
value:	zU+1
(sU+1, zU+1)
…….	
70
Supervised	learning	:	Regression	
Input:	
Golden	
value:	
Backward	propagaIon	
Value	
network:	
s v✓
v✓(s)
Output	
value:	
z
✓ ✓ + ⌘(z v✓(s))
@v✓(s)
@✓
71
Searching	
SelecIon	 Expansion	 EvaluaIon	 Backup	
Q + u
max	
r
v✓
Q + u
update	
p
p⇡
72
•  Each	edge	stores	a	set	of	staIsIcs		
•  														:	combined	mean	acIon	value		
•  														:	prior	probability	evaluated	by			
•  																:	esImated	acIon	value	by	
•  																:	esImated	acIon	value	by	
•  																:	counts	of	evaluaIons	by	
•  																:	counts	of	evaluaIons	by		
Searching	
P(s, a)
Nv(s, a)
Nr(s, a)
Wr(s, a)
Wv(s, a)
Q(s, a)
s
a
v✓(s)
p⇡(a|s)
v✓(s)
p⇡(a|s)
p (a|s)
73
SelecIon	
a⇤
= argmax
a
(Q(s, a) + u(s, a))
u(s, a) = cP(s, a)
pP
b Nr(s, b)
1 + Nr(s, a)
Choose	acIon		
a⇤
PUCT	Algorithm:	
ExploitaIon	 ExploraIon	
Visit	counts	
of	parent	node		s
Visit	counts	
of	edge												(s, a)
Level	of	exploraIon	
s
74
Expansion	
s
a
s0
Insert	the	node	for	the	successor	
state					.		s0
1	
2	
Nv(s0
, a0
) = Nr(s0
, a0
) = 0
Wr(s0
, a0
) = Wv(s0
, a0
) = 0
P(s0
, a0
) = p (a0
|s0
)
p (a0
|s0
)
If	visit	count	exceed	a	threshold	:	
						,	Nr(s, a) > nthr
a0
a0
For	every	possible						,	iniIalize	
the	staIsIcs:				
a0
75
EvaluaIon	
p⇡
1	
2	 Simulate	the	acIon	by		
rollout	policy	network								.	p⇡
Evaluate														by	value	network						.	v✓(s0
) v✓
r(sT )
v✓(s0
)
When	reaching	terminal							,		
calculate	the	reward													.		
sT
r(sT )
76
Backup	
s
a
r(sT )
v✓(s0
)s0
Update	the	staIsIcs	of	every	
visited	edge												:																(s, a)
Nr(s, a) Nr(s, a) + 1
Wr(s, a) Wr(s, a) + r(sT )
Nv(s, a) Nv(s, a) + 1
Wv(s, a) Wv(s, a) + v✓(s0
)
Q(s, a) = (1 )
Wv(s, a)
Nv(s, a)
+
Wr(s, a)
Nr(s, a)
InterpolaIon	constant	
77
Distributed	Search		
p⇡
r(sT )
v✓(s0
)
p (a0
|s0
)
Main	search	tree	
Master	CPU	
Policy	&	value	networks	
176	GPUs	
Rollout	policy	networks	
1,202	CPUs		
78
Reference	
•  Mastering	the	game	of	Go	with	deep	neural	
networks	and	tree	search	
– hqp://www.nature.com/nature/journal/v529/n7587/
full/nature16961.html	
79
Further	Reading	
•  Monte	Carlo	Tree	Search	
– hqps://jesradberry.com/posts/2015/09/intro-to-
monte-carlo-tree-search/	
•  Neural	Networks	Backward	PropagaIon	
– hqp://cpmarkchang.logdown.com/posts/277349-
neural-network-backward-propagaIon	
•  ConvoluIonal	Neural	Networks	
– hqp://cs231n.github.io/convoluIonal-networks/	
•  Policy	Gradient	Method:	REINFORCE	
– hqps://www.cs.cmu.edu/afs/cs/project/jair/pub/
volume4/kaelbling96a-html/node37.html		
80
About	the	Speaker	
•  Email:	ckmarkoh	at	gmail	dot	com	
•  Blog:	hqp://cpmarkchang.logdown.com	
•  Github:	hqps://github.com/ckmarkoh	
F.C.C	
Mark	Chang	
•  Facebook:	hqps://www.facebook.com/ckmarkoh.chang	
•  Slideshare:	hqp://www.slideshare.net/ckmarkohchang	
•  Linkedin:	hqps://www.linkedin.com/pub/mark-chang/85/25b/847	
81

More Related Content

What's hot

What's hot (20)

強化學習 Reinforcement Learning
強化學習 Reinforcement Learning強化學習 Reinforcement Learning
強化學習 Reinforcement Learning
 
Understanding AlphaGo
Understanding AlphaGoUnderstanding AlphaGo
Understanding AlphaGo
 
Recommending for the World
Recommending for the WorldRecommending for the World
Recommending for the World
 
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human KnowledgeAlphaGo Zero: Mastering the Game of Go Without Human Knowledge
AlphaGo Zero: Mastering the Game of Go Without Human Knowledge
 
Reinforcement learning
Reinforcement learningReinforcement learning
Reinforcement learning
 
AlphaZero
AlphaZeroAlphaZero
AlphaZero
 
Monte Carlo Tree Search for the Super Mario Bros
Monte Carlo Tree Search for the Super Mario BrosMonte Carlo Tree Search for the Super Mario Bros
Monte Carlo Tree Search for the Super Mario Bros
 
Causative Adversarial Learning
Causative Adversarial LearningCausative Adversarial Learning
Causative Adversarial Learning
 
Multi-Armed Bandit and Applications
Multi-Armed Bandit and ApplicationsMulti-Armed Bandit and Applications
Multi-Armed Bandit and Applications
 
Multi armed bandit
Multi armed banditMulti armed bandit
Multi armed bandit
 
Introduction to Neural Networks in Tensorflow
Introduction to Neural Networks in TensorflowIntroduction to Neural Networks in Tensorflow
Introduction to Neural Networks in Tensorflow
 
Recent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix PerspectiveRecent Trends in Personalization: A Netflix Perspective
Recent Trends in Personalization: A Netflix Perspective
 
An introduction to reinforcement learning
An introduction to  reinforcement learningAn introduction to  reinforcement learning
An introduction to reinforcement learning
 
Imitation learning tutorial
Imitation learning tutorialImitation learning tutorial
Imitation learning tutorial
 
Reinforcement Learning
Reinforcement LearningReinforcement Learning
Reinforcement Learning
 
A brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to gamesA brief overview of Reinforcement Learning applied to games
A brief overview of Reinforcement Learning applied to games
 
Intro to Deep Reinforcement Learning
Intro to Deep Reinforcement LearningIntro to Deep Reinforcement Learning
Intro to Deep Reinforcement Learning
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
Reinforcement learning slides
Reinforcement learning slidesReinforcement learning slides
Reinforcement learning slides
 
An introduction to deep reinforcement learning
An introduction to deep reinforcement learningAn introduction to deep reinforcement learning
An introduction to deep reinforcement learning
 

Similar to AlphaGo in Depth

COA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.pptCOA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.ppt
Ruhul Amin
 
Brief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNsBrief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNs
Ahmed Gad
 
Midterm revision 2022 without answer.pdf
Midterm revision 2022  without answer.pdfMidterm revision 2022  without answer.pdf
Midterm revision 2022 without answer.pdf
AhmedSalah48055
 

Similar to AlphaGo in Depth (19)

Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 
Deep learning simplified
Deep learning simplifiedDeep learning simplified
Deep learning simplified
 
eel6935_ch2.pdf
eel6935_ch2.pdfeel6935_ch2.pdf
eel6935_ch2.pdf
 
AlphaZero and beyond: Polygames
AlphaZero and beyond: PolygamesAlphaZero and beyond: Polygames
AlphaZero and beyond: Polygames
 
Transformers ASR.pdf
Transformers ASR.pdfTransformers ASR.pdf
Transformers ASR.pdf
 
COA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.pptCOA-unit-2-Arithmetic.ppt
COA-unit-2-Arithmetic.ppt
 
Brief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNsBrief Introduction to Deep Learning + Solving XOR using ANNs
Brief Introduction to Deep Learning + Solving XOR using ANNs
 
Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation Multilayer Neuronal network hardware implementation
Multilayer Neuronal network hardware implementation
 
02-gates-w.pptx
02-gates-w.pptx02-gates-w.pptx
02-gates-w.pptx
 
Logic Blox vintage user guide
Logic Blox vintage user guideLogic Blox vintage user guide
Logic Blox vintage user guide
 
Machine learning 2016: deep networks and Monte Carlo Tree Search
Machine learning 2016: deep networks and Monte Carlo Tree SearchMachine learning 2016: deep networks and Monte Carlo Tree Search
Machine learning 2016: deep networks and Monte Carlo Tree Search
 
Machine learning 2016: deep networks and Monte Carlo Tree Search
Machine learning 2016: deep networks and Monte Carlo Tree SearchMachine learning 2016: deep networks and Monte Carlo Tree Search
Machine learning 2016: deep networks and Monte Carlo Tree Search
 
Geohydrology ii (3)
Geohydrology ii (3)Geohydrology ii (3)
Geohydrology ii (3)
 
Digital logic
Digital logicDigital logic
Digital logic
 
Programming at King's
Programming at King'sProgramming at King's
Programming at King's
 
Gan seminar
Gan seminarGan seminar
Gan seminar
 
Automatic Tuning of the OP-1 Synthesizer Using a Multi-objective Genetic Algo...
Automatic Tuning of the OP-1 Synthesizer Using a Multi-objective Genetic Algo...Automatic Tuning of the OP-1 Synthesizer Using a Multi-objective Genetic Algo...
Automatic Tuning of the OP-1 Synthesizer Using a Multi-objective Genetic Algo...
 
Deep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo DovalDeep Learning con CNTK by Pablo Doval
Deep Learning con CNTK by Pablo Doval
 
Midterm revision 2022 without answer.pdf
Midterm revision 2022  without answer.pdfMidterm revision 2022  without answer.pdf
Midterm revision 2022 without answer.pdf
 

More from Mark Chang

More from Mark Chang (20)

Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
Modeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential EquationModeling the Dynamics of SGD by Stochastic Differential Equation
Modeling the Dynamics of SGD by Stochastic Differential Equation
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
Information in the Weights
Information in the WeightsInformation in the Weights
Information in the Weights
 
PAC Bayesian for Deep Learning
PAC Bayesian for Deep LearningPAC Bayesian for Deep Learning
PAC Bayesian for Deep Learning
 
PAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep LearningPAC-Bayesian Bound for Deep Learning
PAC-Bayesian Bound for Deep Learning
 
Domain Adaptation
Domain AdaptationDomain Adaptation
Domain Adaptation
 
NTU ML TENSORFLOW
NTU ML TENSORFLOWNTU ML TENSORFLOW
NTU ML TENSORFLOW
 
NTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANsNTHU AI Reading Group: Improved Training of Wasserstein GANs
NTHU AI Reading Group: Improved Training of Wasserstein GANs
 
Generative Adversarial Networks
Generative Adversarial NetworksGenerative Adversarial Networks
Generative Adversarial Networks
 
Applied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural NetworksApplied Deep Learning 11/03 Convolutional Neural Networks
Applied Deep Learning 11/03 Convolutional Neural Networks
 
The Genome Assembly Problem
The Genome Assembly ProblemThe Genome Assembly Problem
The Genome Assembly Problem
 
DRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive WriterDRAW: Deep Recurrent Attentive Writer
DRAW: Deep Recurrent Attentive Writer
 
淺談深度學習
淺談深度學習淺談深度學習
淺談深度學習
 
Variational Autoencoder
Variational AutoencoderVariational Autoencoder
Variational Autoencoder
 
TensorFlow 深度學習快速上手班--深度學習
 TensorFlow 深度學習快速上手班--深度學習 TensorFlow 深度學習快速上手班--深度學習
TensorFlow 深度學習快速上手班--深度學習
 
TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用TensorFlow 深度學習快速上手班--電腦視覺應用
TensorFlow 深度學習快速上手班--電腦視覺應用
 
TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用TensorFlow 深度學習快速上手班--自然語言處理應用
TensorFlow 深度學習快速上手班--自然語言處理應用
 
TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習TensorFlow 深度學習快速上手班--機器學習
TensorFlow 深度學習快速上手班--機器學習
 
Computational Linguistics week 10
 Computational Linguistics week 10 Computational Linguistics week 10
Computational Linguistics week 10
 

Recently uploaded

Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 

AlphaGo in Depth

Editor's Notes

  1. \underset{i}{\mathrm{argmax }} ( \bar{x_{i}}+\sqrt{\frac{2 \text{log} n} {n_{i}} } )
  2. a^{*} = \underset{a}{\mathrm{argmax }} ( Q(s,a) +c \sqrt{ \frac{ \text{log} N(s)} {N(s,a)} } ) R(s,a) = Q(s,a) +c \sqrt{ \frac{ \text{log} N(s)} {N(s,a)} } a^{*} = \underset{a}{\mathrm{argmax }} R(s,a)
  3. R(s,a_{1}) = 2/3 + 0.5 \sqrt{\frac{\text{log}5}{3}} = 1.3029 R(s,a_{2}) = 1/2 + 0.5 \sqrt{\frac{\text{log}5}{2}} =0.9485 a^{*} = \underset{a}{\mathrm{argmax }} R(s,a) = a_{1}
  4. \frac{1-e^{-2n_{in}}}{1+e^{-2n_{in}}}
  5. n_{1,out} = \frac{e^{n_{1,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{2,out} = \frac{e^{n_{2,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{3,out} = \frac{e^{n_{3,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{1,in} &n_{1,in} > n_{2,in } \text{ }and \\ &n_{1,in} > n_{3, in} &n_{1,in} < n_{2,in } \text{ }or \\ &n_{1,in} < n_{3, in}
  6. n_{1,out} = \frac{e^{n_{1,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{2,out} = \frac{e^{n_{2,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{3,out} = \frac{e^{n_{3,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }} n_{1,in} &n_{1,in} > n_{2,in } \text{ }and \\ &n_{1,in} > n_{3, in} &n_{1,in} < n_{2,in } \text{ }or \\ &n_{1,in} < n_{3, in}
  7. \textbf{s} p_{w}(\textbf{a}|\textbf{s}) \text{log} ( p_{w}(a|\textbf{s}) ) w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a|\textbf{s}) )}{\partial w} s p_{w}(\textbf{a}|\textbf{s}) \text{log} ( p_{w}(a_{i} |\textbf{s}) ) w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w} w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
  8. \textbf{s} p_{w}(\textbf{a}|\textbf{s}) \text{log} ( p_{w}(a|\textbf{s}) ) w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a|\textbf{s}) )}{\partial w} s p_{w}(\textbf{a}|\textbf{s}) \text{log} ( p_{w}(a_{i} |\textbf{s}) ) w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w} w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
  9. - \text{log} ( p_{w}(a_{i} |\textbf{s}) ) -\dfrac{ \partial - \text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w} w = w - \eta \dfrac{ \partial -\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w} p_{w}(a_{i}|\textbf{s}) \approx 1 p_{w}(a_{i}|\textbf{s}) \approx 0
  10. - \text{log} ( p_{w}(a_{i} |\textbf{s}) ) -\dfrac{ \partial - \text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w} w = w - \eta \dfrac{ \partial -\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
  11. p_{w}(a_{i} |\textbf{s}) a_{i}=1 a_{i}=0 \textbf{s}
  12. \dfrac{\partial J}{\partial n_{21(out)}} w_{21,11} \leftarrow w_{21,11} - \eta \dfrac{\partial J}{\partial n_{21(out)}} \dfrac{\partial n_{21(out)}}{\partial n_{21(in)}} \dfrac{\partial n_{21(in)}}{\partial w_{21,11}} \dfrac{\partial J}{\partial n_{2(out)}} = \dfrac{\partial J}{\partial n_{2(out)}} \dfrac{\partial n_{2(out)}}{\partial n_{2(in)}} \dfrac{\partial n_{2(in)}}{\partial w_{21}} w_{21} \leftarrow w_{21} - \eta \dfrac{\partial J}{\partial w_{21}} w_{21} \leftarrow w_{21} - \eta \dfrac{\partial J}{\partial n_{2(out)}} \dfrac{\partial n_{2(out)}}{\partial n_{2(in)}} \dfrac{\partial n_{2(in)}}{\partial w_{21}} & \dfrac{\partial J}{\partial n_{2(out)}} = \dfrac{\partial J}{\partial n_{2(out)}} \dfrac{\partial n_{2(out)}}{\partial n_{2(in)}} \dfrac{\partial n_{2(in)}}{\partial w_{21}} \\ & w_{21} \leftarrow w_{21} - \eta \dfrac{\partial J}{\partial w_{21}} \\ & w_{21} \leftarrow w_{21} - \eta \dfrac{\partial J}{\partial n_{2(out)}} \dfrac{\partial n_{2(out)}}{\partial n_{2(in)}} \dfrac{\partial n_{2(in)}}{\partial w_{21}} \\ n_{2(out)}
  13. V(s_{2}) \pi(a_{1} \mid s) > \pi(a_{2} \mid s) & \pi(a_{1} \mid s) \\ & = 0.5 & \pi(a_{2} \mid s) \\ & = 0.5
  14. s_{1} q_{\pi}(s,a)
  15. w \leftarrow w + (r - b )\dfrac{\partial \pi(a|s)}{\partial w}
  16. \pi(a|s)
  17. p_{\sigma} p_{\pi} p_{\rho} v_{\theta}
  18. p_{\sigma}(a|s) \sigma \leftarrow \sigma - \eta \dfrac{\partial - \text{log} p_{\sigma}(a|s)}{\partial \sigma}
  19. \Delta \rho \propto \dfrac{\partial p_{\rho}(a|s) }{\partial \rho} r \Delta \rho \propto \dfrac{\partial p_{\rho}(a|s) }{\partial \rho} r
  20. p_{\rho}(a_{1}|s_{1}) p_{\rho}(a_{2}|s_{2}) p_{\rho}(a_{T}|s_{T}) r(s_{T}) \rho \rightarrow \rho + \alpha \sum_{i=1}^{T} \dfrac{\partial p_{\rho}(a_{i}|s_{i}) }{\partial \rho} ( r(s_{T}) - b(s_{t}) ) \rho \rightarrow \rho + \alpha \sum_{i=1}^{T} \dfrac{\partial p_{\rho}(a_{i}|s_{i}) }{\partial \rho} r(s_{T})
  21. \theta \leftarrow \theta + \eta (z-v_{\theta}(s)) \dfrac{\partial v_{\theta}(s) }{\partial \theta} \theta \leftarrow \theta + \eta (z-v_{\theta}(s)) \dfrac{\partial v_{\theta}(s) }{\partial \theta}
  22. P(s,a) N_{v}(s,a) N_{r}(s,a) W_{r}(s,a) W_{v}(s,a) v_{\theta}(s) p_{\sigma}(a|s) p_{\pi}(a|s)
  23. P(s,a) N_{v}(s,a) N_{r}(s,a) W_{r}(s,a) W_{v}(s,a) a^{*} a^{*} = \underset{a}{\mathrm{argmax }}\text{ } Q(s,a) + cP(s,a)\frac{\sqrt{\sum_{b}{N_{r}(s,b)}}}{1+N_{r}(s,a)} a^{*} = \underset{a}{\mathrm{argmax }}( Q(s,a) + u(s,a) ) u(s,a) = cP(s,a)\frac{\sqrt{\sum_{b}{N_{r}(s,b)}}}{1+N_{r}(s,a)}
  24. N_{v}(s',a') = N_{r}(s',a') = 0 W_{r}(s’,a’) = W_{v}(s’,a’ )= 0 P(s’,a’) = p_{\sigma}(a’|s’) N_{r}(s',a') > n_{thr} p_{\sigma}(a'|s') & N_{v}(s',a') = N_{r}(s',a') = 0 \\ & W_{r}(s’,a’) = W_{v}(s’,a’ )= 0 \\ & P(s’,a’) = p_{\sigma}(a’|s’) \\
  25. v_{\theta} r({s_{T}}) z = r({s_{T}})
  26. & N_{r}(s,a) \leftarrow N_{r}(s,a) +1 \\ & W_{r}(s,a) \leftarrow W_{r}(s,a) + r(s_{T}) \\ & N_{v}(s,a) \leftarrow N_{v}(s,a) +1 \\ & W_{v}(s,a) \leftarrow W_{v}(s,a) + v_{\theta}(s') \\ & Q(s,a) = (1-\lambda) \frac{W_{v}(s,a)}{N_{v}(s,a)}+\lambda \frac{W_{r}(s,a)}{N_{r}(s,a)} \\