More Related Content Similar to AlphaGo in Depth (19) More from Mark Chang (20) AlphaGo in Depth 14. UCB1 algorithm
argmax
i
( ¯xi +
r
2logn
ni
)
• : The mean payout for machine
• : The number of plays of machine
• : The total number of plays
¯xi +
r
2logn
ni¯xi +
r
2logn
ni
i
i
n
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
6
7
8
6
7
8
6
7
8
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
F
G
H
14
15. UCB1 algorithm
R(s, a) = Q(s, a) + c
s
logN(s)
N(s, a) 3/5
1/2 2/3
0/1 1/2 1/1 1/1
1/1 0/1
a⇤
= argmax
a
R(s, a)
3/5
2/3 N(s, a) = 3
N(s) = 5
c = constant
Q(s, a) = 2/3
15
21. Neural Networks
• AND Gate
x1
x2
y
0
0
0
0
1
0
1
0
0
1
1
1
(0,0)
(0,1)
(1,1)
(1,0)
0
1
n
b
-30
y
x1
x2
20
20
y =
1
1 + e (20x1+20x2 30)
20x1 + 20x2 30 = 0
21
22. Neural Networks
x
y
n11
n12
n21
n22
b b
z1
z2
W12,y
W12,x
W11,y
W11,b
W12,b
W11,x
W21,11
W22,12
W21,12
W22,11
W21,b
W22,b
Input
layer
Hidden
layer
Output
layer
22
23. Neural Networks
• XOR Gate
n
-20
20
b
-10
y
(0,0)
(0,1)
(1,1)
(1,0)
0
1
(0,0)
(0,1)
(1,1)
(1,0)
1
0
(0,0)
(0,1)
(1,1)
(1,0)
0
0
1
n1
b
-30
20
20
x1
x2
n2
b
-10
20
20
x1
x2
x1
x2
n1
n2
y
0
0
0
0
0
0
1
0
1
1
1
0
0
1
1
1
1
1
1
0
23
24. MulI-Class ClassificaIon
• SogMax
n1
n2
n3
n1,out =
en1,in
en1,in + en2,in + en3,in
n2,out =
en2,in
en1,in + en2,in + en3,in
n3,out =
en3,in
en1,in + en2,in + en3,in
n1,in
n2,in
n3,in
24
25. MulI-Class ClassificaIon
• SogMax
n1,out =
en1,in
en1,in + en2,in + en3,in
n1
n2
n3
n1,in
n2,in
n3,in
25
n1,in n2,in and
n1,in n3,in
n1,in ⌧ n2,in or
n1,in ⌧ n3,in
30. Training Neural Networks
• One-Hot Encoding:
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Player’s
stones
Opponent’s
stones
Empty
posiIons
Next
posiIon
Input Output
30
31. Training Neural Networks
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
Forward propagaIon
pw(a|s)
Inputs:
Input
layer
ConvoluIonal
layer
Output
layer
Outputs:
s
31
32. Training Neural Networks
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 0 0
0 0 0 0 0
0 0 0 0 0
0 0 1 0 0
0 0 0 1 0
0 0 1 0 0
0 0 0 0 0
1 1 1 1 1
1 1 0 1 1
1 1 0 0 1
1 1 0 1 1
1 1 1 1 1
0 0 0 0 0
0 .5 0 0 0
0 .3 0 0 0
0 .2 0 0 0
0 0 0 0 0
Inputs:
Input
layer
ConvoluIonal
layer
Output
layer
Outputs:
s
pw(a|s)
0 0 0 0 0
0 0 0 0 0
0 1 0 0 0
0 0 0 0 0
0 0 0 0 0
Golden:
Backward propagaIon
Cost funcIon:
ai
w = w ⌘
@ log(pw(ai|s))
@w
32
log(pw(ai|s))
45. Policy Networks
0 0 0 0
0 1 0 0
0 0 0 0
0 0 0 0
One-hot
encoding ProbabiliIes of acIons
Sampling
Execute
acIon
Neural
Networks
s ⇡(a|s)
45
61. Input/Output Data
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
1 0 1
1 0 0
1 0 1
Input
Next
posiIon
Output
0 0 0
1 0 0
0 0 0
Stone color:
3 planes
player, opponent, empty
Liberty:
8 planes
1~8 liberIes
Stone color, Liberty, Turns
since, Capture size,
Self-atari size, Ladder capture,
Ladder escape, Sensibleness.
Total: 48 planes
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 1 0
0 0 0
0 1 0
0 0 1
0 1 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
0 0 0
61
67. Policy Gradient Method
p⇢ p⇢ p⇢ p⇢
p⇢(a1|s1) p⇢(a2|s2) p⇢(aT |sT )
s1 s2 sT
r(sT )
…….
…….
Reward Backward propagaIon
Learning rate Baseline
67
⇢ ! ⇢ + ↵
TX
i=1
@logp⇢(ai|si)
@⇢
(r(sT ) b(st))
74. SelecIon
a⇤
= argmax
a
(Q(s, a) + u(s, a))
u(s, a) = cP(s, a)
pP
b Nr(s, b)
1 + Nr(s, a)
Choose acIon
a⇤
PUCT Algorithm:
ExploitaIon ExploraIon
Visit counts
of parent node s
Visit counts
of edge (s, a)
Level of exploraIon
s
74
77. Backup
s
a
r(sT )
v✓(s0
)s0
Update the staIsIcs of every
visited edge : (s, a)
Nr(s, a) Nr(s, a) + 1
Wr(s, a) Wr(s, a) + r(sT )
Nv(s, a) Nv(s, a) + 1
Wv(s, a) Wv(s, a) + v✓(s0
)
Q(s, a) = (1 )
Wv(s, a)
Nv(s, a)
+
Wr(s, a)
Nr(s, a)
InterpolaIon constant
77
Editor's Notes \underset{i}{\mathrm{argmax }} (
\bar{x_{i}}+\sqrt{\frac{2 \text{log} n} {n_{i}} } ) a^{*} = \underset{a}{\mathrm{argmax }} (
Q(s,a) +c \sqrt{ \frac{ \text{log} N(s)} {N(s,a)} } )
R(s,a) = Q(s,a) +c \sqrt{ \frac{ \text{log} N(s)} {N(s,a)} }
a^{*} = \underset{a}{\mathrm{argmax }} R(s,a)
R(s,a_{1}) = 2/3 + 0.5 \sqrt{\frac{\text{log}5}{3}} = 1.3029
R(s,a_{2}) = 1/2 + 0.5 \sqrt{\frac{\text{log}5}{2}} =0.9485
a^{*} = \underset{a}{\mathrm{argmax }} R(s,a) = a_{1} \frac{1-e^{-2n_{in}}}{1+e^{-2n_{in}}} n_{1,out} = \frac{e^{n_{1,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{2,out} = \frac{e^{n_{2,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{3,out} = \frac{e^{n_{3,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{1,in}
&n_{1,in} > n_{2,in } \text{ }and \\
&n_{1,in} > n_{3, in}
&n_{1,in} < n_{2,in } \text{ }or \\
&n_{1,in} < n_{3, in}
n_{1,out} = \frac{e^{n_{1,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{2,out} = \frac{e^{n_{2,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{3,out} = \frac{e^{n_{3,in}}}{e^{n_{1,in}}+e^{n_{2,in}}+e{^{n_{3,in}} }}
n_{1,in}
&n_{1,in} > n_{2,in } \text{ }and \\
&n_{1,in} > n_{3, in}
&n_{1,in} < n_{2,in } \text{ }or \\
&n_{1,in} < n_{3, in}
\textbf{s}
p_{w}(\textbf{a}|\textbf{s})
\text{log} ( p_{w}(a|\textbf{s}) )
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a|\textbf{s}) )}{\partial w}
s
p_{w}(\textbf{a}|\textbf{s})
\text{log} ( p_{w}(a_{i} |\textbf{s}) )
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
\textbf{s}
p_{w}(\textbf{a}|\textbf{s})
\text{log} ( p_{w}(a|\textbf{s}) )
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a|\textbf{s}) )}{\partial w}
s
p_{w}(\textbf{a}|\textbf{s})
\text{log} ( p_{w}(a_{i} |\textbf{s}) )
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
w = w - \eta \dfrac{ \partial\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
- \text{log} ( p_{w}(a_{i} |\textbf{s}) )
-\dfrac{ \partial - \text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
w = w - \eta \dfrac{ \partial -\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
p_{w}(a_{i}|\textbf{s}) \approx 1
p_{w}(a_{i}|\textbf{s}) \approx 0 - \text{log} ( p_{w}(a_{i} |\textbf{s}) )
-\dfrac{ \partial - \text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
w = w - \eta \dfrac{ \partial -\text{log} ( p_{w}(a_{i} |\textbf{s}) )}{\partial w}
p_{w}(a_{i} |\textbf{s})
a_{i}=1
a_{i}=0
\textbf{s}
\dfrac{\partial J}{\partial n_{21(out)}}
w_{21,11} \leftarrow w_{21,11} - \eta
\dfrac{\partial J}{\partial n_{21(out)}}
\dfrac{\partial n_{21(out)}}{\partial n_{21(in)}}
\dfrac{\partial n_{21(in)}}{\partial w_{21,11}}
\dfrac{\partial J}{\partial n_{2(out)}}
= \dfrac{\partial J}{\partial n_{2(out)}}
\dfrac{\partial n_{2(out)}}{\partial n_{2(in)}}
\dfrac{\partial n_{2(in)}}{\partial w_{21}}
w_{21} \leftarrow w_{21} - \eta
\dfrac{\partial J}{\partial w_{21}}
w_{21} \leftarrow w_{21} - \eta
\dfrac{\partial J}{\partial n_{2(out)}}
\dfrac{\partial n_{2(out)}}{\partial n_{2(in)}}
\dfrac{\partial n_{2(in)}}{\partial w_{21}}
&
\dfrac{\partial J}{\partial n_{2(out)}}
= \dfrac{\partial J}{\partial n_{2(out)}}
\dfrac{\partial n_{2(out)}}{\partial n_{2(in)}}
\dfrac{\partial n_{2(in)}}{\partial w_{21}} \\
&
w_{21} \leftarrow w_{21} - \eta
\dfrac{\partial J}{\partial w_{21}} \\
&
w_{21} \leftarrow w_{21} - \eta
\dfrac{\partial J}{\partial n_{2(out)}}
\dfrac{\partial n_{2(out)}}{\partial n_{2(in)}}
\dfrac{\partial n_{2(in)}}{\partial w_{21}} \\
n_{2(out)}
V(s_{2})
\pi(a_{1} \mid s) > \pi(a_{2} \mid s)
& \pi(a_{1} \mid s) \\
& = 0.5
& \pi(a_{2} \mid s) \\
& = 0.5
s_{1}
q_{\pi}(s,a)
w \leftarrow w + (r - b )\dfrac{\partial \pi(a|s)}{\partial w} \pi(a|s) p_{\sigma}
p_{\pi}
p_{\rho}
v_{\theta} p_{\sigma}(a|s)
\sigma \leftarrow \sigma - \eta
\dfrac{\partial - \text{log} p_{\sigma}(a|s)}{\partial \sigma} \Delta \rho \propto \dfrac{\partial p_{\rho}(a|s) }{\partial \rho} r
\Delta \rho \propto \dfrac{\partial p_{\rho}(a|s) }{\partial \rho} r p_{\rho}(a_{1}|s_{1})
p_{\rho}(a_{2}|s_{2})
p_{\rho}(a_{T}|s_{T})
r(s_{T})
\rho \rightarrow \rho + \alpha \sum_{i=1}^{T} \dfrac{\partial p_{\rho}(a_{i}|s_{i}) }{\partial \rho} ( r(s_{T}) - b(s_{t}) )
\rho \rightarrow \rho + \alpha \sum_{i=1}^{T} \dfrac{\partial p_{\rho}(a_{i}|s_{i}) }{\partial \rho} r(s_{T})
\theta \leftarrow \theta + \eta (z-v_{\theta}(s)) \dfrac{\partial v_{\theta}(s) }{\partial \theta}
\theta \leftarrow \theta + \eta (z-v_{\theta}(s)) \dfrac{\partial v_{\theta}(s) }{\partial \theta} P(s,a)
N_{v}(s,a)
N_{r}(s,a)
W_{r}(s,a)
W_{v}(s,a)
v_{\theta}(s)
p_{\sigma}(a|s)
p_{\pi}(a|s) P(s,a)
N_{v}(s,a)
N_{r}(s,a)
W_{r}(s,a)
W_{v}(s,a)
a^{*}
a^{*} = \underset{a}{\mathrm{argmax }}\text{ } Q(s,a) + cP(s,a)\frac{\sqrt{\sum_{b}{N_{r}(s,b)}}}{1+N_{r}(s,a)}
a^{*} = \underset{a}{\mathrm{argmax }}( Q(s,a) + u(s,a) )
u(s,a) = cP(s,a)\frac{\sqrt{\sum_{b}{N_{r}(s,b)}}}{1+N_{r}(s,a)} N_{v}(s',a') = N_{r}(s',a') = 0
W_{r}(s’,a’) = W_{v}(s’,a’ )= 0
P(s’,a’) = p_{\sigma}(a’|s’)
N_{r}(s',a') > n_{thr}
p_{\sigma}(a'|s')
& N_{v}(s',a') = N_{r}(s',a') = 0 \\
& W_{r}(s’,a’) = W_{v}(s’,a’ )= 0 \\
& P(s’,a’) = p_{\sigma}(a’|s’) \\
v_{\theta}
r({s_{T}})
z = r({s_{T}}) & N_{r}(s,a) \leftarrow N_{r}(s,a) +1 \\
& W_{r}(s,a) \leftarrow W_{r}(s,a) + r(s_{T}) \\
& N_{v}(s,a) \leftarrow N_{v}(s,a) +1 \\
& W_{v}(s,a) \leftarrow W_{v}(s,a) + v_{\theta}(s') \\
& Q(s,a) = (1-\lambda) \frac{W_{v}(s,a)}{N_{v}(s,a)}+\lambda \frac{W_{r}(s,a)}{N_{r}(s,a)} \\