AlphaGo in Depth

1
AlphaGo in Depth
by Mark Chang
1

Overview
•  AI in Game Playing
•  Machine Learning and Deep Learning
•  Reinforcement Learning
•  AlphaGo's Methods
2

AI in Game Playing
•  AI in Game Playing
– Adversarial Search – MiniMax
– Monte Carlo Tree Search
– MulI-Armed Bandit Problem
3

Adversarial Search – MiniMax
-1 0
1 0
0 0 -1 -1 -1 -1
1 1
5

Adversarial Search – MiniMax
-1 0
1 0
0 0 -1 -1 -1 -1
1 1
-1 0 -1 -1
1 1 0 -1
1 -1
6

Monte Carlo Tree Search
…….
…….
…….
…….
…….
…….
…….
…….
Tree search
Monte Carlo search
…….
7

Monte Carlo Tree Search
•  Tree Search + Monte Carlo Method
– SelecIon
– Expansion
– SimulaIon
– Back-PropagaIon
3/5
1/2 2/3
0/1
white wins / total
1/2 1/1 1/1
1/1 0/1
8

SelecIon
3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
9

Expansion
3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
0/0
10

SimulaIon
0/0
…….
3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
11

Back-PropagaIon
4/6
1/2 3/4
0/1 2/3 1/1 1/1
2/2 0/1
1/1
12

MulI-Armed Bandit Problem
•  ExploraIon v.s. ExploitaIon
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
13

UCB1 algorithm
argmax
i
( ¯xi +
r
2logn
ni
)
•  : The mean payout for machine
•  : The number of plays of machine
•  : The total number of plays
¯xi +
r
2logn
ni¯xi +
r
2logn
ni
i
i
n
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
14

UCB1 algorithm
R(s, a) = Q(s, a) + c
s
logN(s)
N(s, a) 3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
a⇤
= argmax
a
R(s, a)
3/5
2/3 N(s, a) = 3
N(s) = 5
c = constant
Q(s, a) = 2/3
15

UCB1 algorithm
3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
R(s, a1) = 2/3 + 0.5
r
log5
3
= 1.3029
R(s, a2) = 1/2 + 0.5
r
log5
2
= 0.9485
s
a1 a2
R(s, a1) > R(s, a2)
a⇤
= argmax
a
R(s, a) = a1
16

Machine Learning and Deep Learning
•  Machine Learning and Deep Learning :
– Supervised Machine Learning
– Neural Networks
– ConvoluIonal Neural Networks
– Training Neural Networks
17

Supervised Machine Learning
Machine
Learning
Model
Problem: SoluIon: Output:
Problem: Output:
Machine
Learning
Model
Feedback
18

Supervised Machine Learning
ClassiﬁcaIon Regression
Class A Class B
Score = 1
Score = ?
Score = 3
19
Class ?

Neural Networks
n
W1
W2
x1
x2
b
Wb
nin = w1x1 + w2x2 + wb
Sigmoid:
nout =
1
1 + e nin
nout
1 e 2nin
1 + e 2nin
tanh:
ReLU: ⇢
nin if nin > 0
0 otherwise
20

Neural Networks
•  AND Gate
x1
x2
y
0
0
0
0
1
0
1
0
0
1
1
1
(0,0)
(0,1)
(1,1)
(1,0)
0
1
n
b
-30
y
x1
x2
20
20
y =
1
1 + e (20x1+20x2 30)
20x1 + 20x2 30 = 0
21

Neural Networks
x
y
n11
n12
n21
n22
b b
z1
z2
W12,y
W12,x
W11,y
W11,b
W12,b
W11,x
W21,11
W22,12
W21,12
W22,11
W21,b
W22,b
Input
layer
Hidden
layer
Output
layer
22

Neural Networks
•  XOR Gate
n
-20
20
b
-10
y
(0,0)
(0,1)
(1,1)
(1,0)
0
1
(0,0)
(0,1)
(1,1)
(1,0)
1
0
(0,0)
(0,1)
(1,1)
(1,0)
0
0
1
n1
b
-30
20
20
x1
x2
n2
b
-10
20
20
x1
x2
x1
x2
n1
n2
y
0
0
0
0
0
0
1
0
1
1
1
0
0
1
1
1
1
1
1
0
23

MulI-Class ClassiﬁcaIon
•  SogMax
n1
n2
n3
n1,out =
en1,in
en1,in + en2,in + en3,in
n2,out =
en2,in
n3,out =
en3,in
n1,in
n2,in
n3,in
24

•  SogMax
n1,out =
en1,in
n1
n2
n3
n1,in
n2,in
n3,in
25
n1,in n2,in and
n1,in n3,in
n1,in ⌧ n2,in or
n1,in ⌧ n3,in

•  One-Hot Encoding:
Class 1 Class 2 Class 3
1
0
0
0
1
0
0
0
1
n1
n2
n3
26

ConvoluIonal Neural Neworks
depth
width width depth
weights
height
shared weights
27

ConvoluIonal
layer
RecepIve ﬁelds RecepIve ﬁelds
Input
layer
ConvoluIonal
layer
……. …….
28

Input layer
Filter responses Input image
Filters in convoluIonal layers
………..
29

Training Neural Networks
•  One-Hot Encoding:
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Player’s
stones
Opponent’s
stones
Empty
posiIons
Next
posiIon
Input Output
30

0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
Forward propagaIon
pw(a|s)
Inputs:
Input
layer
ConvoluIonal
layer
Output
layer
Outputs:
s
31

0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
Inputs:
Input
layer
ConvoluIonal
layer
Output
layer
Outputs:
s
pw(a|s)
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Golden:
Backward propagaIon
Cost funcIon:
ai
w = w ⌘
@ log(pw(ai|s))
@w
32
log(pw(ai|s))

Cost FuncIon
w w
pw(ai|s) ⇡ 0
pw(ai|s) ⇡ 1
log(pw(ai|s))
33

Gradient Descent
w w
Learning Rate
w = w ⌘
@ log(pw(ai|s))
@w
@ log(pw(ai|s))
@w
34

Gradient Descent
pw(ai|s)
s
w w ⌘
@log(pw(ai|s))
@w
35
ai = 1Golden is 1
ai = 1Golden is 0

Backward PropagaIon
n2
n1
J
Cost
funcIon:
n2(out) n2(in) w21
37
@J
@w21
=
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21
w21 w21 ⌘
@J
@w21
w21 w21 ⌘
@J
@n2(out)
@n2(out)
@n2(in)
@n2(in)
@w21

Reinforcement Learning
•  Reinforcement Learning :
– Policy & Value
– Policy Gradient Method
38

Reward (Feedback)
Reward (Feedback)
white
win
black
win
39

State: St
Reward
(Feedback): Rt
AcIon: At
•  Feedback is delayed.
•  No supervisor, only a reward signal.
•  Rules of the game are unknown.
•  Agent’s acIons aﬀect the subsequent state
Agent
Environment
40

•  The behavior of an agent

Policy
sstate
a1acIon a2acIon
⇡(a2 | s)
= 0.5
⇡(a1 | s)
= 0.5
StochasIc Policy
sstate
DeterminisIc Policy
⇡(s) = a
acIon a
41

Value
•  The expected long-term reward
sstate q⇡(s, a)v⇡(s)
acIon a⇡policy
State-value FuncIon AcIon-value FuncIon
rreward:
⇡policy
sstate
rreward:
⇡policy
end end
42

Policy Gradient Method
•  REINFOCE
– the REward Increment = NonnegaIve Factor Oﬀset
ReinforCEment

weights in
policy funcIon
reward baseline
(usually = 0)
learning
rate
43
w w + ↵(r b)
@log⇡(a|s)
@w

Grid World Example
4 x 4 Grid World
Terminal
Reward =1
Terminal
Reward = -1
IniIal
posiIon
AcIon
Agent
44

Policy Networks
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
One-hot
encoding ProbabiliIes of acIons
Sampling
Execute
acIon
Neural
Networks
s ⇡(a|s)
45

IniIalizaIon
Neural
Networks
IniIalized with
random weights
⇡(a|s) for every s
46

Forward PropagaIon
Neural
Networks
⇡(a|s)
s
47

Forward PropagaIon
Neural
Networks
⇡(a|s)
s
Reward r = 1
48

Backward PropagaIon
Neural
Networks
Reward r = 1
49
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w

Backward PropagaIon
Neural
Networks
Reward r = 1
50
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w

Next IteraIon
Neural
Networks
51

Forward PropagaIon
Neural
Networks
⇡(a|s)
s
52

Forward PropagaIon
Neural
Networks
⇡(a|s)
s
Reward r = 1
53

Backward PropagaIon
Neural
Networks
Reward r = 1
54
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w

Backward PropagaIon
Neural
Networks
Reward r = 1
55
w w + ↵(r b)
@log⇡(a|s)
@w
r
@log⇡(a|s)
@w

Next IteraIon …
Neural
Networks
56

Ager Several IteraIons …
57

AlphaGo’s Methods
•  Training:
– Supervised learning : ClassiﬁcaIon
– Reinforcement learning
– Supervised learning : Regression
•  Searching:
– Searching with policy and value networks
– Distributed search
58

Training
Human expert data Self-play data
Rollout
policy
SL policy
network
RL policy
network
Value
network
ClassiﬁcaIon Regression
Policy
gradient
Generate
data
IniIalize
weights
p⇡ p p⇢ v✓
59

Supervised Learning : ClassiﬁcaIon
Human expert data
Rollout
policy
SL policy
network
ClassiﬁcaIon
p⇡ p
KGS dataset
160,000 games
29.4 million posiIons
linear-sogmax network
(faster but less accurate)
13-layers convoluIonal
neural network
50 GPUs, 3 weeks
Accuracy : 57.0%
60

Input/Output Data
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
1 0 1
1 0 0
1 0 1
Input
Next
posiIon
Output
0 0 0
1 0 0
0 0 0
Stone color:
3 planes
player, opponent, empty
Liberty:
8 planes
1~8 liberIes
Stone color, Liberty, Turns
since, Capture size,
Self-atari size, Ladder capture,
Ladder escape, Sensibleness.
Total: 48 planes
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
61

Symmetries
62
Input
RotaIon
90 degrees
RotaIon
180 degrees
RotaIon
270 degrees
VerIcal
reflecIon
VerIcal
reflecIon
VerIcal
reflecIon
VerIcal
reflecIon

SL Policy Network
Input
Size: 19x19
48 planes
First layer
Conv+ReLU
Kernel size: 5x5
k filters
2nd to 12th layers
Conv+ReLU
Kernel size: 3x3
k filters
13th layer
Kernel size: 1x1
1 filters
Sogmax
k = 192
63

Supervised Learning : ClassiﬁcaIon
Input:
Golden
acIon:
Backward
propagaIon
p
SL Policy
network:
s ap (a|s)
ProbabiliIes
of acIons:
Learning rate
⌘
@ logp (a|s)
@
64

Self-play data
RL policy
network
Policy
gradient
p⇢
50 GPUs, 1 day
won 80% SL network
10,000 x 128 games
Weights iniIalized
by SL policy network
65

p⇢
SL policy
network
p
p⇢
RL policy network
Opponent
IniIalize
Weights
⇢ = ⇢ =
play End
Policy
Gradient
method
p⇢
Opponent pool
Add to
opponent pool
p⇢
r
reward
66

Policy Gradient Method
p⇢ p⇢ p⇢ p⇢
p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT )
s1 s2 sT
r(sT )
…….
…….
Reward Backward propagaIon
Learning rate Baseline
67
⇢ ! ⇢ + ↵
TX
i=1
@logp⇢(ai|si)
@⇢
(r(sT ) b(st))

Supervised learning : Regression
Self-play data
RL policy
network
Value
network
Regression
Generate
data
p⇢
v✓
30 million posiIons
50 GPUs, 1 week
MSE: 0.226
iniIalize
weights
15-layers convoluIonal
neural network
68

Value Network
Input
Size: 19x19
48 planes
14th layer
Fully-connected
256 ReLU unit
1st ~ 13th
The same as
policy networks
15th layer
Fully-connected
1 tanh unit
+1 unit
(current color)
69

Input/Output Data
…….
Played by RL policy
network p⇢
Played by SL policy
network p
Randomly sample an integer U in 1~450 1
2
t = 1 t = U-1 t = U+1 end
Random
acIon
t = U
3 Generate Training Example:
sU+1state: rreward:
value: zU+1
(sU+1, zU+1)
…….
70

Supervised learning : Regression
Input:
Golden
value:
Backward propagaIon
Value
network:
s v✓
v✓(s)
Output
value:
z
✓ ✓ + ⌘(z v✓(s))
@v✓(s)
@✓
71

Searching
SelecIon Expansion EvaluaIon Backup
Q + u
max
r
v✓
Q + u
update
p
p⇡
72

•  Each edge stores a set of staIsIcs
•  : combined mean acIon value
•  : prior probability evaluated by
•  : esImated acIon value by
•  : esImated acIon value by
•  : counts of evaluaIons by
•  : counts of evaluaIons by
Searching
P(s, a)
Nv(s, a)
Nr(s, a)
Wr(s, a)
Wv(s, a)
Q(s, a)
s
a
v✓(s)
p⇡(a|s)
v✓(s)
p⇡(a|s)
p (a|s)
73

SelecIon
a⇤
= argmax
a
(Q(s, a) + u(s, a))
u(s, a) = cP(s, a)
pP
b Nr(s, b)
1 + Nr(s, a)
Choose acIon
a⇤
PUCT Algorithm:
ExploitaIon ExploraIon
Visit counts
of parent node s
Visit counts
of edge (s, a)
Level of exploraIon
s
74

Expansion
s
a
s0
Insert the node for the successor
state . s0
1
2
Nv(s0
, a0
) = Nr(s0
, a0
) = 0
Wr(s0
, a0
) = Wv(s0
, a0
) = 0
P(s0
, a0
) = p (a0
|s0
)
p (a0
|s0
)
If visit count exceed a threshold :
, Nr(s, a) > nthr
a0
a0
For every possible , iniIalize
the staIsIcs:
a0
75

EvaluaIon
p⇡
1
2 Simulate the acIon by
rollout policy network . p⇡
Evaluate by value network . v✓(s0
) v✓
r(sT )
v✓(s0
)
When reaching terminal ,
calculate the reward .
sT
r(sT )
76

Backup
s
a
r(sT )
v✓(s0
)s0
Update the staIsIcs of every
visited edge : (s, a)
Nr(s, a) Nr(s, a) + 1
Wr(s, a) Wr(s, a) + r(sT )
Nv(s, a) Nv(s, a) + 1
Wv(s, a) Wv(s, a) + v✓(s0
)
Q(s, a) = (1 )
Wv(s, a)
Nv(s, a)
+
Wr(s, a)
Nr(s, a)
InterpolaIon constant
77

Distributed Search
p⇡
r(sT )
v✓(s0
)
p (a0
|s0
)
Main search tree
Master CPU
Policy & value networks
176 GPUs
Rollout policy networks
1,202 CPUs
78

Reference
•  Mastering the game of Go with deep neural
networks and tree search
– hqp://www.nature.com/nature/journal/v529/n7587/
full/nature16961.html
79

Further Reading
•  Monte Carlo Tree Search
– hqps://jesradberry.com/posts/2015/09/intro-to-
monte-carlo-tree-search/
•  Neural Networks Backward PropagaIon
– hqp://cpmarkchang.logdown.com/posts/277349-
neural-network-backward-propagaIon
•  ConvoluIonal Neural Networks
– hqp://cs231n.github.io/convoluIonal-networks/
•  Policy Gradient Method: REINFORCE
– hqps://www.cs.cmu.edu/afs/cs/project/jair/pub/
volume4/kaelbling96a-html/node37.html
80

About the Speaker
•  Email: ckmarkoh at gmail dot com
•  Blog: hqp://cpmarkchang.logdown.com
•  Github: hqps://github.com/ckmarkoh
F.C.C
Mark Chang
•  Facebook: hqps://www.facebook.com/ckmarkoh.chang
•  Slideshare: hqp://www.slideshare.net/ckmarkohchang
•  Linkedin: hqps://www.linkedin.com/pub/mark-chang/85/25b/847
81

AlphaGo in Depth

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to AlphaGo in Depth

Similar to AlphaGo in Depth (19)

More from Mark Chang

More from Mark Chang (20)

Recently uploaded

Recently uploaded (20)

AlphaGo in Depth

Editor's Notes