25. 𝑉(𝐼, 𝑎|𝜃))の学習
■ Loss function
𝐸 -,J=, ̃rL=
~tu,<
[𝑡? P
M
̃𝑟J=
𝑎 − 𝑉 𝐼, 𝑎 𝜃)
A
]
– 𝑀y,) : プレイヤ𝑝のregretを保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き
– Iteration毎に重みをランダム初期化して⼀から学習
※これまでのiterationのregret ̃𝑟J=
(𝐼, 𝑎)の平均はtotal regret z𝑅J(𝐼, 𝑎)に⽐
例→古いデータも残しておく必要があるので,reservoir sampling
26. Π(𝐼, 𝑎|𝜃p)の学習
■ 平均戦略 l𝜎O(𝐼)を近似
※CFRでは平均戦略がナッシュ均衡戦略へと収束
■ Loss function
𝐸 -,J=,4L=
~t{
[𝑡? P
M
𝜎J=
𝑎 − Π 𝐼, 𝑎 𝜃p
A
]
– 𝑀p : 各プレイヤの戦略を保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き
27. Algorithm
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
28. Algorithm
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
ゲーム⽊を探索してデータ収集
29. Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
Algorithm
𝑉(𝐼, 𝑎|𝜃))の学習
30. Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
Algorithm
Π(𝐼, 𝑎|𝜃p)の学習
31. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
32. Algorithm
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
𝑉(𝐼, 𝑎|𝜃))から戦略𝜎J(𝐼)を計算
33. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
Regretを計算して収集
34. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p
t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧
t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
戦略を計算して収集
40. Network Architecture
■ 𝑉でもΠでも同じarchitectureを使⽤
■ カードの数字やマークをembeddingして⼊⼒
Deep Counterfactual Regret Minimization
✓t
) on a sample in MV,p
minimum loss achievable
⇤ ✏L.
ly large, then with proba-
s bounded by
p
T + 4T|Ip|
p
|A| ✏L
(5)
regret
RT
p
T is bounded by
✏L
Figure 1. The neural network architecture used for Deep CFR.
The network takes an infoset (observed cards and bet history) as
input and outputs values (advantages or probability logits) for each
possible action.
(1-4), and a card embedding (1-52). These embeddings
41. Results : FHP
r learning approximate Nash
mation games. NFSP approx-
hich is proven to converge to
ctice does so far slower than
CFR reaches an exploitability
verges to 47 mbb/g.4
We also
re sample efficient than NFSP.
d most of their wallclock time
our implementation we see a
ver NFSP in wallclock time
ce of Deep CFR using differ-
als, network SGD steps, and
f CFR traversals per iteration
mes slower but the model con-
oitability. This is presumably
ons to collect enough data to
ly. On the other hand, reduc-
s does not change the rate of
ymptotic exploitability of the
e model architecture as we use
k game of Leduc Hold’em, our
s an average exploitability (total
the professional human competitors lost to the 2007 Polaris
AI by about 52 ± 10 mbb/g (Johanson, 2016). In contrast,
our Deep CFR agent loses to a 3.3 · 108
bucket abstraction
by only 11 ± 2 mbb/g and beats NFSP by 43 ± 2 mbb/g.
10
6
10
7
10
8
10
9
10
10
10
11
1Rdes 7RuFhed
10
2
10
3
(xSlRLtDELlLty(PEE/g)
CRnvergenFe Rf DeeS C)R, 1)63, Dnd DRPDLn-6SeFLfLF AEstrDFtLRns
DeeS C)R
1)63 (1,000 LnfRsets / uSdDte)
1)63 (10,000 LnfRsets / uSdDte)
AEstrDFtLRn (40,000 Clusters)
AEstrDFtLRn (368,000 Clusters)
AEstrDFtLRn (3,644,000 Clusters)
/Rssless AEstrDFtLRn (2340 Clusters)
Figure 2. Comparison of Deep CFR with domain-specific tabular
abstractions and NFSP in FHP. Coarser abstractions converge faster
but are more exploitable. Deep CFR converges with 2-3 orders of
n Deep CFRはNFSPよりも早く収束
n Deep CFRのほうがSGD stepsに多くの実⾏時間を費やすため,
総実⾏時間で⾒るとNFSPに⽐べて劇的に早いわけではない
42. Results : FHP
n Iterationごとのゲーム⽊探索回数を減らすと収束が遅くなる
n IterationごとのSGD steps数はexploitabilityの収束値に影響を与える
n 中間層のユニット数を増やすとexploitabilityが減少する
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
10
1
10
2
C)R ,teratLRn
10
2
10
3
(xplRLtaELlLty(mEE/g)
Traversals per Lter
3,000
10,000
30,000
100,000
300,000
1,000,000
/Lnear C)R
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
Traversals per Lter
10
1
10
2
C)R ,terDtLRn
10
2
10
3
(xSlRLtDELlLty(mEE/g)
6GD steSs Ser Lter
1,000
2,000
4,000
8,000
16,000
32,000
/LneDr C)R
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
6GD steSs Ser Lter
10
4
10
5
10
6
# 0odel 3araPeters
10
2
(xploitaEility(PEE/g)
diP 8
diP 16
diP 32
diP 64
diP 128 diP 256
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
43. Results : FHP
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR
ine shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP converg
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of D
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
10
1
10
2
C)5 IWeUDWLRn
10
2
10
3
(xSlRLWDELlLWy(PEE/g)
DeeS C)5 (5 UeSlLFDWes)
DeeS C)5 wLWhRuW /LneDU WeLghWLng
DeeS C)5 wLWhRuW 5eWUDLnLng IURP SFUDWFh
DeeS C)5 3lDyLng UnLIRUP when All 5egUeWs < 0
Figure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, wh
exploitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting co
exploitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training fro
exploitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative
matching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR
6
shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergen
mbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Dee
erent model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
10
1
10
2
C)R IWerDWiRn
10
2
10
3
(xSlRiWDEiliWy(mEE/g)
DeeS C)R
DeeS C)R wiWh Sliding WindRw 0emRries
ure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, which
loitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting conve
loitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training from
loitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative (i.e
ching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR is
6
n 各iterationで𝑉をfine-tuningするようにするとexploitabilityが悪化
n メモリをreservoir samplingからsliding windowに変えるとexploitabilityが不安定
44. Results : HULH
n NFSPには⼤きく勝ち越し
n Abstractionの粗さを⼩さく(細かく)するとやや負ける
Deep Counterfactual Regret Minimization
Opponent Model
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
able 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction size
or comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (afte
ariance reduction techniques were applied).