Deep Counterfactual Regret Minimization

DEEP COUNTERFACTUAL REGRET
MINIMIZATION
AI Lab 阿部拳之
2019/09/03

Summary
■ Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm
■ ICML 2019
■ Contribution
– DNNをCFRに組み込んだアルゴリズム Deep CFRを提案
– 巨⼤なゲームを扱うためのAbstractionの必要性を排除

Notation
■ ℎ : 全プレイヤの⾏動履歴 or ノード
■ 𝐼 : 情報集合
■ 𝐴(ℎ) : ℎで取ることができる⾏動の集合
■ 𝑃(ℎ) : ℎで⾏動を取るプレイヤ
■ 𝑍 : 終端ノードの集合
■ 𝑢) ∶ 𝑍 → 𝑅 : 終端ノードにおける利得関数
■ 𝑍- : 𝐼から到達することができる終端ノードの集合
■ 𝑧 𝐼 : 𝑧 ∈ 𝑍へと到達することができる𝐼中のノード

Notation
■ 𝜎(𝐼) : 𝐼における戦略
■ 𝜎)(𝐼) : プレイヤ𝑝の戦略
■ 𝜎2)(𝐼) : 𝑝以外のプレイヤの戦略
■ 𝜋4(ℎ) : 𝜎に従ったときにℎに到達する確率
■ 𝜋)
4(ℎ) : プレイヤ𝑝のcontribution
■ 𝜋2)
4 (ℎ) : 𝑝以外のプレイヤのcontribution

⼆⼈零和不完全情報ゲームにおける
最適な戦略
■ ナッシュ均衡
– どのプレイヤも⾃分の戦略を変更することによって，より⾼い利得を得るこ
とができないような戦略の組み合わせ
■ 直感的には，それぞれのプレイヤが⾃分だけ戦略を変えることのメリッ
トがない状況
– 各プレイヤ𝑝の戦略を𝜎)，利得を𝑢)とすると，ナッシュ均衡戦略の組み合わ
せ𝜎∗
は以下の式を満たす
∀𝑝, 𝑢) 𝜎)
∗, 𝜎2)
∗ = max
4<
= ∈><
𝑢)(𝜎)
? , 𝜎2)
∗ )
■ ⼆⼈零和不完全情報ゲームでは，ナッシュ均衡戦略を最適戦略とすることが多い
𝑢@ 𝜎@
∗
, 𝜎A
∗
= max
4B
=∈>B
𝑢@(𝜎@
?
, 𝜎A
∗
) , 𝑢A 𝜎@
∗
, 𝜎A
∗
= max
4C
=∈>C
𝑢A(𝜎@
∗
, 𝜎A
?
)

ナッシュ均衡戦略の計算
■ ゲームの状態数が⼤きい場合，厳密なナッシュ均衡戦略の計算は難しい．．．
– ポーカーで⾔うと・・・
– Heads up Limit Texas Hold’em : 約10@F
– Heads up No-limit Texas Hold’em : 約10@GH
■ できる限りナッシュ均衡戦略に近い戦略を求める
– 代表的なアルゴリズム︓Counterfactual Regret Minimization (CFR)

Counterfactual Regret Minimization
■ 基本的な流れ
– 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す
ることで，（平均）戦略をナッシュ均衡へ近づけていく
■ Counterfactual Regret
𝑟)
J
𝐼, 𝑎 = 𝜋2)
4L
𝐼 (𝑢) 𝜎-→M
J
, 𝐼 − 𝑢) 𝜎J
, 𝐼 )
– 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら，どれ
だけ利得が増えるのか」を定式化
A
B B B
グーチョキ
パー情報集合𝐼
Regret Minimization in Games with Incomplete Information (2008)
https://papers.nips.cc/paper/3306-regret-minimization-in-games-with-incomplete-information

■ 基本的な流れ
– 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す
ることで，（平均）戦略をナッシュ均衡へ近づけていく
■ Counterfactual Regret
𝑟)
J
𝐼, 𝑎 = 𝜋2)
4L
𝐼 (𝑢) 𝜎-→M
J
, 𝐼 − 𝑢) 𝜎J
, 𝐼 )
– 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら，どれ
だけ利得が増えるのか」を定式化
A
B B B
グーチョキ
パー
グーを出すようにしたら，
どれくらい得かな．．．

■ Cumulative Counterfactual Regret
𝑅)
O
𝐼, 𝑎 = P
JQ@
O
𝑟)
J
(𝐼, 𝑎)
■ Cumulative Counterfactual Regretに基づいて，戦略を更新
𝜎)
OR@ 𝐼, 𝑎 =
max(𝑅)
O
𝐼, 𝑎 , 0)
∑M∈T(-) max(𝑅)
O
𝐼, 𝑎 , 0)
, 𝑖𝑓 P
M∈T(-)
max(𝑅)
O
𝐼, 𝑎 , 0) > 0
1
|𝐴 𝐼 |
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
■ 𝑡 = 1, ⋯ , 𝑇までの戦略𝜎)
Jを平均化した戦略がナッシュ均衡戦略へと収束
– ただし，⼆⼈零和不完全情報ゲームのみ保証
– 三⼈以上では収束性の証明はされていない

■ アルゴリズム概要
1. ある情報集合𝐼に対して，以下の２つの価値を計算
今の戦略の期待利得︓ 𝑢) 𝜎J
, 𝐼
”𝐼において⾏動𝑎を取るように変えた戦略”の期待
利得︓𝑢) 𝜎-→M
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 ， 𝑢) 𝜎-→M
J
, 𝐼 を元に，regretを計算
3. Regretを元に，戦略を更新
A
B B B
今の戦略の期待利得 : 𝑢) 𝜎J
, 𝐼
𝑎 𝑏 𝑐

, 𝐼
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 ， 𝑢) 𝜎-→M
J
A
B B B
, 𝐼
𝑎 𝑏 𝑐
𝑅)
O
𝐼, 𝑎 = 2 𝑅)
O
𝐼, 𝑏 = 0 𝑅)
O
𝐼, 𝑐 = 1

, 𝐼
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 ， 𝑢) 𝜎-→M
J
A
B B B
, 𝐼
𝑎 𝑏 𝑐
𝑅)
O
𝐼, 𝑎 = 2 𝑅)
O
𝐼, 𝑏 = 0 𝑅)
O
𝐼, 𝑐 = 1
𝜎)
OR@(𝐼)を(2/3, 0, 1/3)に更新

■ さらに⾼速化したアルゴリズムCFR+によって，10@F
程度の情報集合数を持つ
ゲームにおいて収束できるように
■ Heads up Limit Texas Hold’em（情報集合数︓約10@F）はほぼ攻略︕︕
■ しかしHeads up No-limit Texas Hold’em（情報集合数︓約10@GH）はまだまだ遠
い．．．
→戦略ベクトルをPCに保持できるような容量ですらないので，そもそも戦略を
求めようとすることが現実的ではない
→ゲームを抽象化 (abstraction) して，情報集合数を現実的なサイズまで減らす
Solving Heads-up Limit Texas Hold’em (2015)
https://poker.cs.ualberta.ca/publications/2015-ijcai-cfrplus.pdf

Abstraction
■ ゲームを抽象化することによって情報集合数を削減
■ Domainに関する知識を使って設計するのが基本
■ ポーカーにおけるabstractionの例
– ベッドできる⾦額を制限する（action abstraction）
– 同じような強さの⼿札は同⼀のグループとして扱う（information
abstraction）
1. Abstractionを⽤いてゲームを抽象化
2. CFRなどで抽象化したゲームのナッシュ均衡戦略を計算
3. 実際のプレイ時に抽象化ゲームのナッシュ均衡戦略を⽤いる
というのが主流に

Abstraction
■ Domainに関する知識をかなり要する
■ Abstractionをミスると元のゲームのナッシュ均衡戦略と⼤きく離れる場合も
→かなり設計が⾯倒
→できる限り設計したくない︕︕
→どうする︖︖
■ Deep Learningに頼ろう︕︕
→今回発表するDeep CFRの登場

MONTE CARLO
COUNTERFACTUAL
REGRET MINIMIZATION

Monte Carlo counterfactual regret
minimization
■ Iterationごとにゲーム⽊の⼀部分𝑄のみを探索
– 従来のCFR : Iterationごとにゲーム⽊を全探索
→ゲーム⽊が⼤きい場合，実⾏が不可能
■ Iterationごとに探索するノードのサンプリング⽅法
– External sampling
■ Iterationごとに⼀⼈の⾏動のみ全探索
■ それ以外のプレイヤの⾏動は戦略𝜎からサンプリング
→Deep CFRでも使⽤︕︕
– Outcome sampling
■ Iterationごとに全プレイヤの⾏動を戦略𝜎?
からサンプリング
C
𝑃@
𝑃A 𝑃A
𝑃@ C
100 25

External Sampling
■ 全探索しないため，正確なregret 𝑟J(𝐼, 𝑎)はわからない
■ そこで，サンプリングした結果を使ってregretをモンテカルロ近似
̃𝑟)
J 𝐼, 𝑎 = (1 − 𝜎J 𝑎 𝐼 ) P
g∈h∩jk
𝑢) 𝑧 𝜋)
4L
(𝑧 𝐼 𝑎, 𝑧)
– 期待値的に𝑟J(𝐼, 𝑎)と⼀致

External Sampling MCCFR
1. 戦略を更新するプレイヤ (traverser) を𝑝とする
2. ゲーム⽊を探索
a. これまでのiterationのregretから，戦略𝜎Jを計算
b. プレイヤが𝑝であれば，各⾏動𝑎に対してregret
̃𝑟)
J 𝐼, 𝑎 を計算
c. それ以外であれば，⾏動を𝜎J(𝐼)からサンプル
3. Step 1とStep 2を各プレイヤを𝑝として繰り返す
4. これまでのiterationの戦略𝜎Jから，平均戦略 l𝜎O(𝐼)
を計算
C
𝑃@
𝑃A 𝑃A
𝑃@ C
100 2550 120
Regret=
-50
Regret=
20
Regret =
75

DEEP
COUNTERFACTUAL
REGRET MINIMIZATION

Deep CFR
■ CFRの挙動をDNNを使って近似
■ CFRにおける２つの値をDNNで近似
– Regret 𝑅)
J (𝐼, 𝑎)を𝑉(𝐼, 𝑎|𝜃))で近似
– 平均戦略 l𝜎O(𝐼)をΠ(𝐼, 𝑎|𝜃p)で近似
■ External Sampling MCCFRのアルゴリズムにDNNを組み込んだイメージ

External Sampling MCCFRとの差分
1. 戦略を更新する対象のプレイヤ (traverser) を𝑝とする
2. ゲーム⽊を探索
a. これまでのiterationのregretから，戦略𝜎Jを計算
b. プレイヤが𝑝であれば，各⾏動𝑎に対してregret ̃𝑟 𝐼, 𝑎 を計
算
c. それ以外のプレイヤであれば，⾏動を𝜎J(𝐼)からサンプル
3. Step 1とStep 2を各プレイヤを対象にして繰り返す
4. これまでのiterationの戦略𝜎Jから，平均戦略 l𝜎O(𝐼)を計算
DNNで近似︕︕

𝑉(𝐼, 𝑎|𝜃))の学習
■ Regret ̃𝑟)
J 𝐼, 𝑎 を近似
■ 戦略𝜎J(𝐼)を決定するために使⽤
– 𝑉の出⼒に対するRegret matchingで決定

■ Loss function
𝐸 -,J=, ̃rL=
~tu,<
[𝑡? P
M
̃𝑟J=
𝑎 − 𝑉 𝐼, 𝑎 𝜃)
A
]
– 𝑀y,) : プレイヤ𝑝のregretを保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き
– Iteration毎に重みをランダム初期化して⼀から学習
※これまでのiterationのregret ̃𝑟J=
(𝐼, 𝑎)の平均はtotal regret z𝑅J(𝐼, 𝑎)に⽐
例→古いデータも残しておく必要があるので，reservoir sampling

Π(𝐼, 𝑎|𝜃p)の学習
■ 平均戦略 l𝜎O(𝐼)を近似
※CFRでは平均戦略がナッシュ均衡戦略へと収束
■ Loss function
𝐸 -,J=,4L=
~t{
[𝑡? P
M
𝜎J=
𝑎 − Π 𝐼, 𝑎 𝜃p
A
]
– 𝑀p : 各プレイヤの戦略を保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き

Algorithm
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.

Algorithm
function DEEPCFR
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
ゲーム⽊を探索してデータ収集

function DEEPCFR
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm

function DEEPCFR
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm
Π(𝐼, 𝑎|𝜃p)の学習

)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
Algorithm

Algorithm
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
a ⇠ (h)
Compute strategy t
for a 2 A(h) do
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
Compute strategy t
(I).
𝑉(𝐼, 𝑎|𝜃))から戦略𝜎J(𝐼)を計算

)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
a ⇠ (h)
Compute strategy t
for a 2 A(h) do
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
Compute strategy t
(I).
Algorithm
Regretを計算して収集

)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
a ⇠ (h)
Compute strategy t
for a 2 A(h) do
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
Compute strategy t
(I).
Algorithm
戦略を計算して収集

命題
■ 𝑇をDeep CFRのIteration数，𝐴を任意の情報集合での⾏動数の最⼤
値，𝐾をiterationごとにゲーム⽊を探索する回数とする．
メモリ𝑀y,)が⼗分に⼤きいとき，regret 𝑅)
Oは確率1 − 𝜌で以下の式
を満たす．
𝑅)
O ≤ 1 +
2
𝜌𝐾
Δ 𝐿) 𝐴 𝑇 + 4𝑇 𝐿) 𝐴 Δ𝜖„
（ただし，𝐿y
J
を𝑉(𝐼, 𝑎|𝜃)
J)のMSE lossとし， 𝐿y∗
J
を達成できるMSE lossの
最⼩値としたとき，𝜖„は𝐿y
J
− 𝐿y∗
J
≤ 𝜖„を満たす）
→ 𝑇 → ∞のとき，
†<
‡
O
が4 𝐿) 𝐴 Δ𝜖„で抑えられる

Setting
■ Heads-up flop hold’em (FHP)
– ノード数10@A，情報集合数10ˆ
– Head-up limit Texas hold’emの2 Roundで終了するversion
– コミュニティカードは3枚
■ Head-up limit Texas hold’em (HULH)
– ノード数10@‰，情報集合数10@F
■ Baselines
– Neural Fictitious Self Play (NFSP)
– Abstraction + CFR
■ Abstractionの粗さを⾊々変更
■ Lossless : 戦略的に同形の状況をまとめるAbstraction

Network Architecture
■ 𝑉でもΠでも同じarchitectureを使⽤
■ カードの数字やマークをembeddingして⼊⼒
✓t
) on a sample in MV,p
minimum loss achievable
⇤  ✏L.
ly large, then with proba-
s bounded by
p
T + 4T|Ip|
p
|A| ✏L
(5)
regret
RT
p
T is bounded by
✏L
Figure 1. The neural network architecture used for Deep CFR.
The network takes an infoset (observed cards and bet history) as
input and outputs values (advantages or probability logits) for each
possible action.
(1-4), and a card embedding (1-52). These embeddings

Results : FHP
r learning approximate Nash
mation games. NFSP approx-
hich is proven to converge to
ctice does so far slower than
CFR reaches an exploitability
verges to 47 mbb/g.4
We also
re sample efﬁcient than NFSP.
d most of their wallclock time
our implementation we see a
ver NFSP in wallclock time
ce of Deep CFR using differ-
als, network SGD steps, and
f CFR traversals per iteration
mes slower but the model con-
oitability. This is presumably
ons to collect enough data to
ly. On the other hand, reduc-
s does not change the rate of
ymptotic exploitability of the
e model architecture as we use
k game of Leduc Hold’em, our
s an average exploitability (total
the professional human competitors lost to the 2007 Polaris
AI by about 52 ± 10 mbb/g (Johanson, 2016). In contrast,
our Deep CFR agent loses to a 3.3 · 108
bucket abstraction
by only 11 ± 2 mbb/g and beats NFSP by 43 ± 2 mbb/g.
10
6
10
7
10
8
10
9
10
10
10
11
1Rdes 7RuFhed
10
2
10
3
(xSlRLtDELlLty(PEE/g)
CRnvergenFe Rf DeeS C)R, 1)63, Dnd DRPDLn-6SeFLfLF AEstrDFtLRns
DeeS C)R
1)63 (1,000 LnfRsets / uSdDte)
1)63 (10,000 LnfRsets / uSdDte)
AEstrDFtLRn (40,000 Clusters)
AEstrDFtLRn (368,000 Clusters)
AEstrDFtLRn (3,644,000 Clusters)
/Rssless AEstrDFtLRn (2340 Clusters)
Figure 2. Comparison of Deep CFR with domain-speciﬁc tabular
abstractions and NFSP in FHP. Coarser abstractions converge faster
but are more exploitable. Deep CFR converges with 2-3 orders of
n Deep CFRはNFSPよりも早く収束
n Deep CFRのほうがSGD stepsに多くの実⾏時間を費やすため，
総実⾏時間で⾒るとNFSPに⽐べて劇的に早いわけではない

Results : FHP
n Iterationごとのゲーム⽊探索回数を減らすと収束が遅くなる
n IterationごとのSGD steps数はexploitabilityの収束値に影響を与える
n 中間層のユニット数を増やすとexploitabilityが減少する
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
10
1
10
2
C)R ,teratLRn
10
2
10
3
(xplRLtaELlLty(mEE/g)
Traversals per Lter
3,000
10,000
30,000
100,000
300,000
1,000,000
/Lnear C)R
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Abstraction Size
3.3 · 107
3.3 · 108
Traversals per Lter
10
1
10
2
C)R ,terDtLRn
10
2
10
3
(xSlRLtDELlLty(mEE/g)
6GD steSs Ser Lter
1,000
2,000
4,000
8,000
16,000
32,000
/LneDr C)R
Abstraction Size
3.3 · 107
3.3 · 108
6GD steSs Ser Lter
10
4
10
5
10
6
# 0odel 3araPeters
10
2
(xploitaEility(PEE/g)
diP 8
diP 16
diP 32
diP 64
diP 128 diP 256

Results : FHP
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR
ine shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP converg
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of D
10
1
10
2
C)5 IWeUDWLRn
10
2
10
3
(xSlRLWDELlLWy(PEE/g)
DeeS C)5 (5 UeSlLFDWes)
DeeS C)5 wLWhRuW /LneDU WeLghWLng
DeeS C)5 wLWhRuW 5eWUDLnLng IURP SFUDWFh
DeeS C)5 3lDyLng UnLIRUP when All 5egUeWs < 0
Figure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, wh
exploitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting co
exploitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training fro
exploitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative
matching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR
6
shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergen
mbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Dee
erent model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
10
1
10
2
C)R IWerDWiRn
10
2
10
3
(xSlRiWDEiliWy(mEE/g)
DeeS C)R
DeeS C)R wiWh Sliding WindRw 0emRries
ure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, which
loitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting conve
loitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training from
loitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative (i.e
ching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR is
6
n 各iterationで𝑉をfine-tuningするようにするとexploitabilityが悪化
n メモリをreservoir samplingからsliding windowに変えるとexploitabilityが不安定

Results : HULH
n NFSPには⼤きく勝ち越し
n Abstractionの粗さを⼩さく（細かく）するとやや負ける
Opponent Model
Abstraction Size
3.3 · 107
3.3 · 108
able 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction size
or comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (afte
ariance reduction techniques were applied).

まとめと所感
■ DNNをCFRに組み込んだアルゴリズム Deep CFRを提案
■ 巨⼤なゲームを扱うためのAbstractionの必要性を排除
■ AbstractionをいらなくしたことでCFRをかなり汎⽤的にしてくれた
■ 簡単なゲームで実装・実験してみたい

Deep Counterfactual Regret Minimization

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Deep Counterfactual Regret Minimization

Similar to Deep Counterfactual Regret Minimization (20)

Recently uploaded

Recently uploaded (20)

Deep Counterfactual Regret Minimization