SlideShare a Scribd company logo
1 of 45
Download to read offline
DEEP COUNTERFACTUAL REGRET
MINIMIZATION
AI Lab 阿部拳之
2019/09/03
Summary
■ Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm
■ ICML 2019
■ Contribution
– DNNをCFRに組み込んだアルゴリズム Deep CFRを提案
– 巨⼤なゲームを扱うためのAbstractionの必要性を排除
INTRODUCTION
Notation
■ ℎ : 全プレイヤの⾏動履歴 or ノード
■ 𝐼 : 情報集合
■ 𝐴(ℎ) : ℎで取ることができる⾏動の集合
■ 𝑃(ℎ) : ℎで⾏動を取るプレイヤ
■ 𝑍 : 終端ノードの集合
■ 𝑢) ∶ 𝑍 → 𝑅 : 終端ノードにおける利得関数
■ 𝑍- : 𝐼から到達することができる終端ノードの集合
■ 𝑧 𝐼 : 𝑧 ∈ 𝑍へと到達することができる𝐼中のノード
Notation
■ 𝜎(𝐼) : 𝐼における戦略
■ 𝜎)(𝐼) : プレイヤ𝑝の戦略
■ 𝜎2)(𝐼) : 𝑝以外のプレイヤの戦略
■ 𝜋4(ℎ) : 𝜎に従ったときにℎに到達する確率
■ 𝜋)
4(ℎ) : プレイヤ𝑝のcontribution
■ 𝜋2)
4 (ℎ) : 𝑝以外のプレイヤのcontribution
⼆⼈零和不完全情報ゲームにおける
最適な戦略
■ ナッシュ均衡
– どのプレイヤも⾃分の戦略を変更することによって,より⾼い利得を得るこ
とができないような戦略の組み合わせ
■ 直感的には,それぞれのプレイヤが⾃分だけ戦略を変えることのメリッ
トがない状況
– 各プレイヤ𝑝の戦略を𝜎),利得を𝑢)とすると,ナッシュ均衡戦略の組み合わ
せ𝜎∗
は以下の式を満たす
∀𝑝, 𝑢) 𝜎)
∗, 𝜎2)
∗ = max
4<
= ∈><
𝑢)(𝜎)
? , 𝜎2)
∗ )
■ ⼆⼈零和不完全情報ゲームでは,ナッシュ均衡戦略を最適戦略とすることが多い
𝑢@ 𝜎@
∗
, 𝜎A
∗
= max
4B
=∈>B
𝑢@(𝜎@
?
, 𝜎A
∗
) , 𝑢A 𝜎@
∗
, 𝜎A
∗
= max
4C
=∈>C
𝑢A(𝜎@
∗
, 𝜎A
?
)
ナッシュ均衡戦略の計算
■ ゲームの状態数が⼤きい場合,厳密なナッシュ均衡戦略の計算は難しい...
– ポーカーで⾔うと・・・
– Heads up Limit Texas Hold’em : 約10@F
– Heads up No-limit Texas Hold’em : 約10@GH
■ できる限りナッシュ均衡戦略に近い戦略を求める
– 代表的なアルゴリズム︓Counterfactual Regret Minimization (CFR)
Counterfactual Regret Minimization
■ 基本的な流れ
– 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す
ることで,(平均)戦略をナッシュ均衡へ近づけていく
■ Counterfactual Regret
𝑟)
J
𝐼, 𝑎 = 𝜋2)
4L
𝐼 (𝑢) 𝜎-→M
J
, 𝐼 − 𝑢) 𝜎J
, 𝐼 )
– 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら,どれ
だけ利得が増えるのか」を定式化
A
B B B
グー チョキ
パー 情報集合𝐼
Regret Minimization in Games with Incomplete Information (2008)
https://papers.nips.cc/paper/3306-regret-minimization-in-games-with-incomplete-information
Counterfactual Regret Minimization
■ 基本的な流れ
– 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す
ることで,(平均)戦略をナッシュ均衡へ近づけていく
■ Counterfactual Regret
𝑟)
J
𝐼, 𝑎 = 𝜋2)
4L
𝐼 (𝑢) 𝜎-→M
J
, 𝐼 − 𝑢) 𝜎J
, 𝐼 )
– 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら,どれ
だけ利得が増えるのか」を定式化
A
B B B
グー チョキ
パー
グーを出すようにしたら,
どれくらい得かな...
Counterfactual Regret Minimization
■ Cumulative Counterfactual Regret
𝑅)
O
𝐼, 𝑎 = P
JQ@
O
𝑟)
J
(𝐼, 𝑎)
■ Cumulative Counterfactual Regretに基づいて,戦略を更新
𝜎)
OR@ 𝐼, 𝑎 =
max(𝑅)
O
𝐼, 𝑎 , 0)
∑M∈T(-) max(𝑅)
O
𝐼, 𝑎 , 0)
, 𝑖𝑓 P
M∈T(-)
max(𝑅)
O
𝐼, 𝑎 , 0) > 0
1
|𝐴 𝐼 |
, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
■ 𝑡 = 1, ⋯ , 𝑇までの戦略𝜎)
Jを平均化した戦略がナッシュ均衡戦略へと収束
– ただし,⼆⼈零和不完全情報ゲームのみ保証
– 三⼈以上では収束性の証明はされていない
Counterfactual Regret Minimization
■ アルゴリズム概要
1. ある情報集合𝐼に対して,以下の2つの価値を計算
今の戦略の期待利得︓ 𝑢) 𝜎J
, 𝐼
”𝐼において⾏動𝑎を取るように変えた戦略”の期待
利得︓𝑢) 𝜎-→M
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M
J
, 𝐼 を元に,regretを計算
3. Regretを元に,戦略を更新
A
B B B
今の戦略の期待利得 : 𝑢) 𝜎J
, 𝐼
𝑎 𝑏 𝑐
Counterfactual Regret Minimization
■ アルゴリズム概要
1. ある情報集合𝐼に対して,以下の2つの価値を計算
今の戦略の期待利得︓ 𝑢) 𝜎J
, 𝐼
”𝐼において⾏動𝑎を取るように変えた戦略”の期待
利得︓𝑢) 𝜎-→M
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M
J
, 𝐼 を元に,regretを計算
3. Regretを元に,戦略を更新
A
B B B
今の戦略の期待利得 : 𝑢) 𝜎J
, 𝐼
𝑎 𝑏 𝑐
𝑅)
O
𝐼, 𝑎 = 2 𝑅)
O
𝐼, 𝑏 = 0 𝑅)
O
𝐼, 𝑐 = 1
Counterfactual Regret Minimization
■ アルゴリズム概要
1. ある情報集合𝐼に対して,以下の2つの価値を計算
今の戦略の期待利得︓ 𝑢) 𝜎J
, 𝐼
”𝐼において⾏動𝑎を取るように変えた戦略”の期待
利得︓𝑢) 𝜎-→M
J
, 𝐼
2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M
J
, 𝐼 を元に,regretを計算
3. Regretを元に,戦略を更新
A
B B B
今の戦略の期待利得 : 𝑢) 𝜎J
, 𝐼
𝑎 𝑏 𝑐
𝑅)
O
𝐼, 𝑎 = 2 𝑅)
O
𝐼, 𝑏 = 0 𝑅)
O
𝐼, 𝑐 = 1
𝜎)
OR@(𝐼)を(2/3, 0, 1/3)に更新
Counterfactual Regret Minimization
■ さらに⾼速化したアルゴリズムCFR+によって,10@F
程度の情報集合数を持つ
ゲームにおいて収束できるように
■ Heads up Limit Texas Hold’em(情報集合数︓約10@F)はほぼ攻略︕︕
■ しかしHeads up No-limit Texas Hold’em(情報集合数︓約10@GH)はまだまだ遠
い...
→戦略ベクトルをPCに保持できるような容量ですらないので,そもそも戦略を
求めようとすることが現実的ではない
→ゲームを抽象化 (abstraction) して,情報集合数を現実的なサイズまで減らす
Solving Heads-up Limit Texas Hold’em (2015)
https://poker.cs.ualberta.ca/publications/2015-ijcai-cfrplus.pdf
Abstraction
■ ゲームを抽象化することによって情報集合数を削減
■ Domainに関する知識を使って設計するのが基本
■ ポーカーにおけるabstractionの例
– ベッドできる⾦額を制限する(action abstraction)
– 同じような強さの⼿札は同⼀のグループとして扱う(information
abstraction)
1. Abstractionを⽤いてゲームを抽象化
2. CFRなどで抽象化したゲームのナッシュ均衡戦略を計算
3. 実際のプレイ時に抽象化ゲームのナッシュ均衡戦略を⽤いる
というのが主流に
Abstraction
■ Domainに関する知識をかなり要する
■ Abstractionをミスると元のゲームのナッシュ均衡戦略と⼤きく離れる場合も
→かなり設計が⾯倒
→できる限り設計したくない︕︕
→どうする︖︖
■ Deep Learningに頼ろう︕︕
→今回発表するDeep CFRの登場
MONTE CARLO
COUNTERFACTUAL
REGRET MINIMIZATION
Monte Carlo counterfactual regret
minimization
■ Iterationごとにゲーム⽊の⼀部分𝑄のみを探索
– 従来のCFR : Iterationごとにゲーム⽊を全探索
→ゲーム⽊が⼤きい場合,実⾏が不可能
■ Iterationごとに探索するノードのサンプリング⽅法
– External sampling
■ Iterationごとに⼀⼈の⾏動のみ全探索
■ それ以外のプレイヤの⾏動は戦略𝜎からサンプリング
→Deep CFRでも使⽤︕︕
– Outcome sampling
■ Iterationごとに全プレイヤの⾏動を戦略𝜎?
からサンプリング
C
𝑃@
𝑃A 𝑃A
𝑃@ C
100 25
External Sampling
■ 全探索しないため,正確なregret 𝑟J(𝐼, 𝑎)はわからない
■ そこで,サンプリングした結果を使ってregretをモンテカルロ近似
̃𝑟)
J 𝐼, 𝑎 = (1 − 𝜎J 𝑎 𝐼 ) P
g∈h∩jk
𝑢) 𝑧 𝜋)
4L
(𝑧 𝐼 𝑎, 𝑧)
– 期待値的に𝑟J(𝐼, 𝑎)と⼀致
External Sampling MCCFR
■ アルゴリズム概要
1. 戦略を更新するプレイヤ (traverser) を𝑝とする
2. ゲーム⽊を探索
a. これまでのiterationのregretから,戦略𝜎Jを計算
b. プレイヤが𝑝であれば,各⾏動𝑎に対してregret
̃𝑟)
J 𝐼, 𝑎 を計算
c. それ以外であれば,⾏動を𝜎J(𝐼)からサンプル
3. Step 1とStep 2を各プレイヤを𝑝として繰り返す
4. これまでのiterationの戦略𝜎Jから,平均戦略 l𝜎O(𝐼)
を計算
C
𝑃@
𝑃A 𝑃A
𝑃@ C
100 2550 120
Regret=
-50
Regret=
20
Regret =
75
DEEP
COUNTERFACTUAL
REGRET MINIMIZATION
Deep CFR
■ CFRの挙動をDNNを使って近似
■ CFRにおける2つの値をDNNで近似
– Regret 𝑅)
J (𝐼, 𝑎)を𝑉(𝐼, 𝑎|𝜃))で近似
– 平均戦略 l𝜎O(𝐼)をΠ(𝐼, 𝑎|𝜃p)で近似
■ External Sampling MCCFRのアルゴリズムにDNNを組み込んだイメージ
External Sampling MCCFRとの差分
■ アルゴリズム概要
1. 戦略を更新する対象のプレイヤ (traverser) を𝑝とする
2. ゲーム⽊を探索
a. これまでのiterationのregretから,戦略𝜎Jを計算
b. プレイヤが𝑝であれば,各⾏動𝑎に対してregret ̃𝑟 𝐼, 𝑎 を計
算
c. それ以外のプレイヤであれば,⾏動を𝜎J(𝐼)からサンプル
3. Step 1とStep 2を各プレイヤを対象にして繰り返す
4. これまでのiterationの戦略𝜎Jから,平均戦略 l𝜎O(𝐼)を計算
DNNで近似︕︕
𝑉(𝐼, 𝑎|𝜃))の学習
■ Regret ̃𝑟)
J 𝐼, 𝑎 を近似
■ 戦略𝜎J(𝐼)を決定するために使⽤
– 𝑉の出⼒に対するRegret matchingで決定
𝑉(𝐼, 𝑎|𝜃))の学習
■ Loss function
𝐸 -,J=, ̃rL=
~tu,<
[𝑡? P
M
̃𝑟J=
𝑎 − 𝑉 𝐼, 𝑎 𝜃)
A
]
– 𝑀y,) : プレイヤ𝑝のregretを保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き
– Iteration毎に重みをランダム初期化して⼀から学習
※これまでのiterationのregret ̃𝑟J=
(𝐼, 𝑎)の平均はtotal regret z𝑅J(𝐼, 𝑎)に⽐
例→古いデータも残しておく必要があるので,reservoir sampling
Π(𝐼, 𝑎|𝜃p)の学習
■ 平均戦略 l𝜎O(𝐼)を近似
※CFRでは平均戦略がナッシュ均衡戦略へと収束
■ Loss function
𝐸 -,J=,4L=
~t{
[𝑡? P
M
𝜎J=
𝑎 − Π 𝐼, 𝑎 𝜃p
A
]
– 𝑀p : 各プレイヤの戦略を保存するメモリ
■ capacityをオーバーしたらreservoir samplingで上書き
Algorithm
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
Algorithm
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
ゲーム⽊を探索してデータ収集
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
Algorithm
𝑉(𝐼, 𝑎|𝜃))の学習
Deep Counterfactual Regret Minimization
Algorithm 1 Deep Counterfactual Regret Minimization
function DEEPCFR
Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs.
Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧.
for CFR iteration t = 1 to T do
for each player p do
for traversal k = 1 to K do
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
Algorithm
Π(𝐼, 𝑎|𝜃p)の学習
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
Algorithm
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
𝑉(𝐼, 𝑎|𝜃))から戦略𝜎J(𝐼)を計算
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
Regretを計算して収集
TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling
Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0
)⇠MV,p

t0
P
a
⇣
˜rt0
(a) V (I, a|✓p)
⌘2
Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0
)⇠M⇧

t0
P
a
⇣
t0
(a) ⇧(I, a|✓⇧)
⌘2
return ✓⇧
Algorithm 2 CFR Traversal with External Sampling
function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t)
Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player
p, strategy memory M⇧, CFR iteration t.
if h is terminal then
return the payoff to player p
else if h is a chance node then
a ⇠ (h)
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
else if P(h) = p then . If it’s the traverser’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓p) using regret matching.
for a 2 A(h) do
v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action
for a 2 A(h) do
˜r(I, a) v(a)
P
a02A(h) (I, a0
) · v(a0
) . Compute advantages
Insert the infoset and its action advantages (I, t, ˜rt
(I)) into the advantage memory MV
else . If it’s the opponent’s turn to act
Compute strategy t
(I) from predicted advantages V (I(h), a|✓3 p) using regret matching.
Insert the infoset and its action probabilities (I, t, t
(I)) into the strategy memory M⇧
Sample an action a from the probability distribution t
(I).
return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t)
Algorithm
戦略を計算して収集
ANALYSIS
命題
■ 𝑇をDeep CFRのIteration数,𝐴を任意の情報集合での⾏動数の最⼤
値,𝐾をiterationごとにゲーム⽊を探索する回数とする.
メモリ𝑀y,)が⼗分に⼤きいとき,regret 𝑅)
Oは確率1 − 𝜌で以下の式
を満たす.
𝑅)
O ≤ 1 +
2
𝜌𝐾
Δ 𝐿) 𝐴 𝑇 + 4𝑇 𝐿) 𝐴 Δ𝜖„
(ただし,𝐿y
J
を𝑉(𝐼, 𝑎|𝜃)
J)のMSE lossとし, 𝐿y∗
J
を達成できるMSE lossの
最⼩値としたとき,𝜖„は𝐿y
J
− 𝐿y∗
J
≤ 𝜖„を満たす)
→ 𝑇 → ∞のとき,
†<
‡
O
が4 𝐿) 𝐴 Δ𝜖„で抑えられる
証明
■ 省略
EXPERIMENT
Setting
■ Heads-up flop hold’em (FHP)
– ノード数10@A,情報集合数10ˆ
– Head-up limit Texas hold’emの2 Roundで終了するversion
– コミュニティカードは3枚
■ Head-up limit Texas hold’em (HULH)
– ノード数10@‰,情報集合数10@F
■ Baselines
– Neural Fictitious Self Play (NFSP)
– Abstraction + CFR
■ Abstractionの粗さを⾊々変更
■ Lossless : 戦略的に同形の状況をまとめるAbstraction
Network Architecture
■ 𝑉でもΠでも同じarchitectureを使⽤
■ カードの数字やマークをembeddingして⼊⼒
Deep Counterfactual Regret Minimization
✓t
) on a sample in MV,p
minimum loss achievable
⇤  ✏L.
ly large, then with proba-
s bounded by
p
T + 4T|Ip|
p
|A| ✏L
(5)
regret
RT
p
T is bounded by
✏L
Figure 1. The neural network architecture used for Deep CFR.
The network takes an infoset (observed cards and bet history) as
input and outputs values (advantages or probability logits) for each
possible action.
(1-4), and a card embedding (1-52). These embeddings
Results : FHP
r learning approximate Nash
mation games. NFSP approx-
hich is proven to converge to
ctice does so far slower than
CFR reaches an exploitability
verges to 47 mbb/g.4
We also
re sample efficient than NFSP.
d most of their wallclock time
our implementation we see a
ver NFSP in wallclock time
ce of Deep CFR using differ-
als, network SGD steps, and
f CFR traversals per iteration
mes slower but the model con-
oitability. This is presumably
ons to collect enough data to
ly. On the other hand, reduc-
s does not change the rate of
ymptotic exploitability of the
e model architecture as we use
k game of Leduc Hold’em, our
s an average exploitability (total
the professional human competitors lost to the 2007 Polaris
AI by about 52 ± 10 mbb/g (Johanson, 2016). In contrast,
our Deep CFR agent loses to a 3.3 · 108
bucket abstraction
by only 11 ± 2 mbb/g and beats NFSP by 43 ± 2 mbb/g.
10
6
10
7
10
8
10
9
10
10
10
11
1Rdes 7RuFhed
10
2
10
3
(xSlRLtDELlLty(PEE/g)
CRnvergenFe Rf DeeS C)R, 1)63, Dnd DRPDLn-6SeFLfLF AEstrDFtLRns
DeeS C)R
1)63 (1,000 LnfRsets / uSdDte)
1)63 (10,000 LnfRsets / uSdDte)
AEstrDFtLRn (40,000 Clusters)
AEstrDFtLRn (368,000 Clusters)
AEstrDFtLRn (3,644,000 Clusters)
/Rssless AEstrDFtLRn (2340 Clusters)
Figure 2. Comparison of Deep CFR with domain-specific tabular
abstractions and NFSP in FHP. Coarser abstractions converge faster
but are more exploitable. Deep CFR converges with 2-3 orders of
n Deep CFRはNFSPよりも早く収束
n Deep CFRのほうがSGD stepsに多くの実⾏時間を費やすため,
総実⾏時間で⾒るとNFSPに⽐べて劇的に早いわけではない
Results : FHP
n Iterationごとのゲーム⽊探索回数を減らすと収束が遅くなる
n IterationごとのSGD steps数はexploitabilityの収束値に影響を与える
n 中間層のユニット数を増やすとexploitabilityが減少する
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
10
1
10
2
C)R ,teratLRn
10
2
10
3
(xplRLtaELlLty(mEE/g)
Traversals per Lter
3,000
10,000
30,000
100,000
300,000
1,000,000
/Lnear C)R
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
Traversals per Lter
10
1
10
2
C)R ,terDtLRn
10
2
10
3
(xSlRLtDELlLty(mEE/g)
6GD steSs Ser Lter
1,000
2,000
4,000
8,000
16,000
32,000
/LneDr C)R
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes.
For comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (after
variance reduction techniques were applied).
6GD steSs Ser Lter
10
4
10
5
10
6
# 0odel 3araPeters
10
2
(xploitaEility(PEE/g)
diP 8
diP 16
diP 32
diP 64
diP 128 diP 256
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted
line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
Results : FHP
Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR
ine shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP converg
numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of D
different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
10
1
10
2
C)5 IWeUDWLRn
10
2
10
3
(xSlRLWDELlLWy(PEE/g)
DeeS C)5 (5 UeSlLFDWes)
DeeS C)5 wLWhRuW /LneDU WeLghWLng
DeeS C)5 wLWhRuW 5eWUDLnLng IURP SFUDWFh
DeeS C)5 3lDyLng UnLIRUP when All 5egUeWs < 0
Figure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, wh
exploitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting co
exploitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training fro
exploitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative
matching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR
6
shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergen
mbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Dee
erent model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
10
1
10
2
C)R IWerDWiRn
10
2
10
3
(xSlRiWDEiliWy(mEE/g)
DeeS C)R
DeeS C)R wiWh Sliding WindRw 0emRries
ure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, which
loitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting conve
loitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training from
loitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative (i.e
ching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR is
6
n 各iterationで𝑉をfine-tuningするようにするとexploitabilityが悪化
n メモリをreservoir samplingからsliding windowに変えるとexploitabilityが不安定
Results : HULH
n NFSPには⼤きく勝ち越し
n Abstractionの粗さを⼩さく(細かく)するとやや負ける
Deep Counterfactual Regret Minimization
Opponent Model
Abstraction Size
Model NFSP Deep CFR 3.3 · 106
3.3 · 107
3.3 · 108
NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g
Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g
able 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction size
or comparison, in 2007 an AI using abstractions of roughly 3 · 108
buckets defeated human professionals by about 52 mbb/g (afte
ariance reduction techniques were applied).
まとめと所感
■ DNNをCFRに組み込んだアルゴリズム Deep CFRを提案
■ 巨⼤なゲームを扱うためのAbstractionの必要性を排除
■ AbstractionをいらなくしたことでCFRをかなり汎⽤的にしてくれた
■ 簡単なゲームで実装・実験してみたい

More Related Content

What's hot

強化学習における好奇心
強化学習における好奇心強化学習における好奇心
強化学習における好奇心Shota Imai
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験克海 納谷
 
報酬設計と逆強化学習
報酬設計と逆強化学習報酬設計と逆強化学習
報酬設計と逆強化学習Yusuke Nakata
 
二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価Kenshi Abe
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展Deep Learning JP
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learningDeep Learning JP
 
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured PredictionDeep Learning JP
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会Shunichi Sekiguchi
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)Shota Imai
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究についてMasahiro Suzuki
 
古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説suckgeun lee
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論Deep Learning JP
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展Deep Learning JP
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用Ryo Iwaki
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用Yasunori Ozaki
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習Deep Learning JP
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement LearningPreferred Networks
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜Jun Okumura
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類Shintaro Fukushima
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Preferred Networks
 

What's hot (20)

強化学習における好奇心
強化学習における好奇心強化学習における好奇心
強化学習における好奇心
 
強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験強化学習アルゴリズムPPOの解説と実験
強化学習アルゴリズムPPOの解説と実験
 
報酬設計と逆強化学習
報酬設計と逆強化学習報酬設計と逆強化学習
報酬設計と逆強化学習
 
二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価二人零和マルコフゲームにおけるオフ方策評価
二人零和マルコフゲームにおけるオフ方策評価
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
 
[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning[Dl輪読会]introduction of reinforcement learning
[Dl輪読会]introduction of reinforcement learning
 
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
[DL輪読会]Reward Augmented Maximum Likelihood for Neural Structured Prediction
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
 
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
強化学習の基礎と深層強化学習(東京大学 松尾研究室 深層強化学習サマースクール講義資料)
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
 
古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説古典的ゲームAIを用いたAlphaGo解説
古典的ゲームAIを用いたAlphaGo解説
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論
 
[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展[DL輪読会]Control as Inferenceと発展
[DL輪読会]Control as Inferenceと発展
 
方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用方策勾配型強化学習の基礎と応用
方策勾配型強化学習の基礎と応用
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
 
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learningゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
ゼロから始める深層強化学習(NLP2018講演資料)/ Introduction of Deep Reinforcement Learning
 
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
深層強化学習の分散化・RNN利用の動向〜R2D2の紹介をもとに〜
 
不均衡データのクラス分類
不均衡データのクラス分類不均衡データのクラス分類
不均衡データのクラス分類
 
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
Optunaを使ったHuman-in-the-loop最適化の紹介 - 2023/04/27 W&B 東京ミートアップ #3
 

Similar to Deep Counterfactual Regret Minimization

Sampling method : MCMC
Sampling method : MCMCSampling method : MCMC
Sampling method : MCMCSEMINARGROOT
 
Pricing optimization poster version 2 (1)
Pricing optimization poster version 2 (1)Pricing optimization poster version 2 (1)
Pricing optimization poster version 2 (1)Alex Potocki
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationHidekazu Oiwa
 
第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料Takayuki Osogami
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsRerngvit Yanggratoke
 
GraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queriesGraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queriesGábor Szárnyas
 
Lecture 1
Lecture 1Lecture 1
Lecture 1butest
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruzrpmcruz
 
06 Recursion in C.pptx
06 Recursion in C.pptx06 Recursion in C.pptx
06 Recursion in C.pptxMouDhara1
 
Intro to Quantitative Investment (Lecture 1 of 6)
Intro to Quantitative Investment (Lecture 1 of 6)Intro to Quantitative Investment (Lecture 1 of 6)
Intro to Quantitative Investment (Lecture 1 of 6)Adrian Aley
 
Mastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searchingMastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searchingBrian Kim
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboostmichiaki ito
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)STAIR Lab, Chiba Institute of Technology
 

Similar to Deep Counterfactual Regret Minimization (20)

Sampling method : MCMC
Sampling method : MCMCSampling method : MCMC
Sampling method : MCMC
 
Pricing optimization poster version 2 (1)
Pricing optimization poster version 2 (1)Pricing optimization poster version 2 (1)
Pricing optimization poster version 2 (1)
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
Apsipa2016for ss
Apsipa2016for ssApsipa2016for ss
Apsipa2016for ss
 
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via RandomizationICML2013読み会 Large-Scale Learning with Less RAM via Randomization
ICML2013読み会 Large-Scale Learning with Less RAM via Randomization
 
第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料第19回ステアラボ人工知能セミナー発表資料
第19回ステアラボ人工知能セミナー発表資料
 
Gossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large cloudsGossip-based resource allocation for green computing in large clouds
Gossip-based resource allocation for green computing in large clouds
 
2_1 Edit Distance.pptx
2_1 Edit Distance.pptx2_1 Edit Distance.pptx
2_1 Edit Distance.pptx
 
GraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queriesGraphBLAS: A linear algebraic approach for high-performance graph queries
GraphBLAS: A linear algebraic approach for high-performance graph queries
 
Lecture 1
Lecture 1Lecture 1
Lecture 1
 
Python for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo CruzPython for Scientific Computing -- Ricardo Cruz
Python for Scientific Computing -- Ricardo Cruz
 
Macro
MacroMacro
Macro
 
06 Recursion in C.pptx
06 Recursion in C.pptx06 Recursion in C.pptx
06 Recursion in C.pptx
 
Intro to Quantitative Investment (Lecture 1 of 6)
Intro to Quantitative Investment (Lecture 1 of 6)Intro to Quantitative Investment (Lecture 1 of 6)
Intro to Quantitative Investment (Lecture 1 of 6)
 
Mastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searchingMastering the game of go with deep neural networks and tree searching
Mastering the game of go with deep neural networks and tree searching
 
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
CoSECiVi 2020 - Parametric Action Pre-Selection for MCTS in Real-Time Strateg...
 
Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...Sequential and reinforcement learning for demand side management by Margaux B...
Sequential and reinforcement learning for demand side management by Margaux B...
 
Introduction of Xgboost
Introduction of XgboostIntroduction of Xgboost
Introduction of Xgboost
 
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
Higher-order Factorization Machines(第5回ステアラボ人工知能セミナー)
 
Calculus-level Coding Overview
Calculus-level Coding OverviewCalculus-level Coding Overview
Calculus-level Coding Overview
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Researchmichael115558
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxolyaivanovalion
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangaloreamitlee9823
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxolyaivanovalion
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Anomaly detection and data imputation within time series
Anomaly detection and data imputation within time seriesAnomaly detection and data imputation within time series
Anomaly detection and data imputation within time series
 
Predicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science ProjectPredicting Loan Approval: A Data Science Project
Predicting Loan Approval: A Data Science Project
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Smarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptxSmarteg dropshipping via API with DroFx.pptx
Smarteg dropshipping via API with DroFx.pptx
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service BangaloreCall Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
Call Girls Begur Just Call 👗 7737669865 👗 Top Class Call Girl Service Bangalore
 
ELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptxELKO dropshipping via API with DroFx.pptx
ELKO dropshipping via API with DroFx.pptx
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 

Deep Counterfactual Regret Minimization

  • 1. DEEP COUNTERFACTUAL REGRET MINIMIZATION AI Lab 阿部拳之 2019/09/03
  • 2. Summary ■ Noam Brown, Adam Lerer, Sam Gross, Tuomas Sandholm ■ ICML 2019 ■ Contribution – DNNをCFRに組み込んだアルゴリズム Deep CFRを提案 – 巨⼤なゲームを扱うためのAbstractionの必要性を排除
  • 4. Notation ■ ℎ : 全プレイヤの⾏動履歴 or ノード ■ 𝐼 : 情報集合 ■ 𝐴(ℎ) : ℎで取ることができる⾏動の集合 ■ 𝑃(ℎ) : ℎで⾏動を取るプレイヤ ■ 𝑍 : 終端ノードの集合 ■ 𝑢) ∶ 𝑍 → 𝑅 : 終端ノードにおける利得関数 ■ 𝑍- : 𝐼から到達することができる終端ノードの集合 ■ 𝑧 𝐼 : 𝑧 ∈ 𝑍へと到達することができる𝐼中のノード
  • 5. Notation ■ 𝜎(𝐼) : 𝐼における戦略 ■ 𝜎)(𝐼) : プレイヤ𝑝の戦略 ■ 𝜎2)(𝐼) : 𝑝以外のプレイヤの戦略 ■ 𝜋4(ℎ) : 𝜎に従ったときにℎに到達する確率 ■ 𝜋) 4(ℎ) : プレイヤ𝑝のcontribution ■ 𝜋2) 4 (ℎ) : 𝑝以外のプレイヤのcontribution
  • 6. ⼆⼈零和不完全情報ゲームにおける 最適な戦略 ■ ナッシュ均衡 – どのプレイヤも⾃分の戦略を変更することによって,より⾼い利得を得るこ とができないような戦略の組み合わせ ■ 直感的には,それぞれのプレイヤが⾃分だけ戦略を変えることのメリッ トがない状況 – 各プレイヤ𝑝の戦略を𝜎),利得を𝑢)とすると,ナッシュ均衡戦略の組み合わ せ𝜎∗ は以下の式を満たす ∀𝑝, 𝑢) 𝜎) ∗, 𝜎2) ∗ = max 4< = ∈>< 𝑢)(𝜎) ? , 𝜎2) ∗ ) ■ ⼆⼈零和不完全情報ゲームでは,ナッシュ均衡戦略を最適戦略とすることが多い 𝑢@ 𝜎@ ∗ , 𝜎A ∗ = max 4B =∈>B 𝑢@(𝜎@ ? , 𝜎A ∗ ) , 𝑢A 𝜎@ ∗ , 𝜎A ∗ = max 4C =∈>C 𝑢A(𝜎@ ∗ , 𝜎A ? )
  • 7. ナッシュ均衡戦略の計算 ■ ゲームの状態数が⼤きい場合,厳密なナッシュ均衡戦略の計算は難しい... – ポーカーで⾔うと・・・ – Heads up Limit Texas Hold’em : 約10@F – Heads up No-limit Texas Hold’em : 約10@GH ■ できる限りナッシュ均衡戦略に近い戦略を求める – 代表的なアルゴリズム︓Counterfactual Regret Minimization (CFR)
  • 8. Counterfactual Regret Minimization ■ 基本的な流れ – 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す ることで,(平均)戦略をナッシュ均衡へ近づけていく ■ Counterfactual Regret 𝑟) J 𝐼, 𝑎 = 𝜋2) 4L 𝐼 (𝑢) 𝜎-→M J , 𝐼 − 𝑢) 𝜎J , 𝐼 ) – 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら,どれ だけ利得が増えるのか」を定式化 A B B B グー チョキ パー 情報集合𝐼 Regret Minimization in Games with Incomplete Information (2008) https://papers.nips.cc/paper/3306-regret-minimization-in-games-with-incomplete-information
  • 9. Counterfactual Regret Minimization ■ 基本的な流れ – 「この⼿を取ったほうが良かった」という後悔 (regret) を元に戦略を更新す ることで,(平均)戦略をナッシュ均衡へ近づけていく ■ Counterfactual Regret 𝑟) J 𝐼, 𝑎 = 𝜋2) 4L 𝐼 (𝑢) 𝜎-→M J , 𝐼 − 𝑢) 𝜎J , 𝐼 ) – 「現在の戦略𝜎Jを情報集合𝐼において⾏動𝑎を取るように変更したら,どれ だけ利得が増えるのか」を定式化 A B B B グー チョキ パー グーを出すようにしたら, どれくらい得かな...
  • 10. Counterfactual Regret Minimization ■ Cumulative Counterfactual Regret 𝑅) O 𝐼, 𝑎 = P JQ@ O 𝑟) J (𝐼, 𝑎) ■ Cumulative Counterfactual Regretに基づいて,戦略を更新 𝜎) OR@ 𝐼, 𝑎 = max(𝑅) O 𝐼, 𝑎 , 0) ∑M∈T(-) max(𝑅) O 𝐼, 𝑎 , 0) , 𝑖𝑓 P M∈T(-) max(𝑅) O 𝐼, 𝑎 , 0) > 0 1 |𝐴 𝐼 | , 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 ■ 𝑡 = 1, ⋯ , 𝑇までの戦略𝜎) Jを平均化した戦略がナッシュ均衡戦略へと収束 – ただし,⼆⼈零和不完全情報ゲームのみ保証 – 三⼈以上では収束性の証明はされていない
  • 11. Counterfactual Regret Minimization ■ アルゴリズム概要 1. ある情報集合𝐼に対して,以下の2つの価値を計算 今の戦略の期待利得︓ 𝑢) 𝜎J , 𝐼 ”𝐼において⾏動𝑎を取るように変えた戦略”の期待 利得︓𝑢) 𝜎-→M J , 𝐼 2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M J , 𝐼 を元に,regretを計算 3. Regretを元に,戦略を更新 A B B B 今の戦略の期待利得 : 𝑢) 𝜎J , 𝐼 𝑎 𝑏 𝑐
  • 12. Counterfactual Regret Minimization ■ アルゴリズム概要 1. ある情報集合𝐼に対して,以下の2つの価値を計算 今の戦略の期待利得︓ 𝑢) 𝜎J , 𝐼 ”𝐼において⾏動𝑎を取るように変えた戦略”の期待 利得︓𝑢) 𝜎-→M J , 𝐼 2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M J , 𝐼 を元に,regretを計算 3. Regretを元に,戦略を更新 A B B B 今の戦略の期待利得 : 𝑢) 𝜎J , 𝐼 𝑎 𝑏 𝑐 𝑅) O 𝐼, 𝑎 = 2 𝑅) O 𝐼, 𝑏 = 0 𝑅) O 𝐼, 𝑐 = 1
  • 13. Counterfactual Regret Minimization ■ アルゴリズム概要 1. ある情報集合𝐼に対して,以下の2つの価値を計算 今の戦略の期待利得︓ 𝑢) 𝜎J , 𝐼 ”𝐼において⾏動𝑎を取るように変えた戦略”の期待 利得︓𝑢) 𝜎-→M J , 𝐼 2. 𝑢) 𝜎J, 𝐼 , 𝑢) 𝜎-→M J , 𝐼 を元に,regretを計算 3. Regretを元に,戦略を更新 A B B B 今の戦略の期待利得 : 𝑢) 𝜎J , 𝐼 𝑎 𝑏 𝑐 𝑅) O 𝐼, 𝑎 = 2 𝑅) O 𝐼, 𝑏 = 0 𝑅) O 𝐼, 𝑐 = 1 𝜎) OR@(𝐼)を(2/3, 0, 1/3)に更新
  • 14. Counterfactual Regret Minimization ■ さらに⾼速化したアルゴリズムCFR+によって,10@F 程度の情報集合数を持つ ゲームにおいて収束できるように ■ Heads up Limit Texas Hold’em(情報集合数︓約10@F)はほぼ攻略︕︕ ■ しかしHeads up No-limit Texas Hold’em(情報集合数︓約10@GH)はまだまだ遠 い... →戦略ベクトルをPCに保持できるような容量ですらないので,そもそも戦略を 求めようとすることが現実的ではない →ゲームを抽象化 (abstraction) して,情報集合数を現実的なサイズまで減らす Solving Heads-up Limit Texas Hold’em (2015) https://poker.cs.ualberta.ca/publications/2015-ijcai-cfrplus.pdf
  • 15. Abstraction ■ ゲームを抽象化することによって情報集合数を削減 ■ Domainに関する知識を使って設計するのが基本 ■ ポーカーにおけるabstractionの例 – ベッドできる⾦額を制限する(action abstraction) – 同じような強さの⼿札は同⼀のグループとして扱う(information abstraction) 1. Abstractionを⽤いてゲームを抽象化 2. CFRなどで抽象化したゲームのナッシュ均衡戦略を計算 3. 実際のプレイ時に抽象化ゲームのナッシュ均衡戦略を⽤いる というのが主流に
  • 18. Monte Carlo counterfactual regret minimization ■ Iterationごとにゲーム⽊の⼀部分𝑄のみを探索 – 従来のCFR : Iterationごとにゲーム⽊を全探索 →ゲーム⽊が⼤きい場合,実⾏が不可能 ■ Iterationごとに探索するノードのサンプリング⽅法 – External sampling ■ Iterationごとに⼀⼈の⾏動のみ全探索 ■ それ以外のプレイヤの⾏動は戦略𝜎からサンプリング →Deep CFRでも使⽤︕︕ – Outcome sampling ■ Iterationごとに全プレイヤの⾏動を戦略𝜎? からサンプリング C 𝑃@ 𝑃A 𝑃A 𝑃@ C 100 25
  • 19. External Sampling ■ 全探索しないため,正確なregret 𝑟J(𝐼, 𝑎)はわからない ■ そこで,サンプリングした結果を使ってregretをモンテカルロ近似 ̃𝑟) J 𝐼, 𝑎 = (1 − 𝜎J 𝑎 𝐼 ) P g∈h∩jk 𝑢) 𝑧 𝜋) 4L (𝑧 𝐼 𝑎, 𝑧) – 期待値的に𝑟J(𝐼, 𝑎)と⼀致
  • 20. External Sampling MCCFR ■ アルゴリズム概要 1. 戦略を更新するプレイヤ (traverser) を𝑝とする 2. ゲーム⽊を探索 a. これまでのiterationのregretから,戦略𝜎Jを計算 b. プレイヤが𝑝であれば,各⾏動𝑎に対してregret ̃𝑟) J 𝐼, 𝑎 を計算 c. それ以外であれば,⾏動を𝜎J(𝐼)からサンプル 3. Step 1とStep 2を各プレイヤを𝑝として繰り返す 4. これまでのiterationの戦略𝜎Jから,平均戦略 l𝜎O(𝐼) を計算 C 𝑃@ 𝑃A 𝑃A 𝑃@ C 100 2550 120 Regret= -50 Regret= 20 Regret = 75
  • 22. Deep CFR ■ CFRの挙動をDNNを使って近似 ■ CFRにおける2つの値をDNNで近似 – Regret 𝑅) J (𝐼, 𝑎)を𝑉(𝐼, 𝑎|𝜃))で近似 – 平均戦略 l𝜎O(𝐼)をΠ(𝐼, 𝑎|𝜃p)で近似 ■ External Sampling MCCFRのアルゴリズムにDNNを組み込んだイメージ
  • 23. External Sampling MCCFRとの差分 ■ アルゴリズム概要 1. 戦略を更新する対象のプレイヤ (traverser) を𝑝とする 2. ゲーム⽊を探索 a. これまでのiterationのregretから,戦略𝜎Jを計算 b. プレイヤが𝑝であれば,各⾏動𝑎に対してregret ̃𝑟 𝐼, 𝑎 を計 算 c. それ以外のプレイヤであれば,⾏動を𝜎J(𝐼)からサンプル 3. Step 1とStep 2を各プレイヤを対象にして繰り返す 4. これまでのiterationの戦略𝜎Jから,平均戦略 l𝜎O(𝐼)を計算 DNNで近似︕︕
  • 24. 𝑉(𝐼, 𝑎|𝜃))の学習 ■ Regret ̃𝑟) J 𝐼, 𝑎 を近似 ■ 戦略𝜎J(𝐼)を決定するために使⽤ – 𝑉の出⼒に対するRegret matchingで決定
  • 25. 𝑉(𝐼, 𝑎|𝜃))の学習 ■ Loss function 𝐸 -,J=, ̃rL= ~tu,< [𝑡? P M ̃𝑟J= 𝑎 − 𝑉 𝐼, 𝑎 𝜃) A ] – 𝑀y,) : プレイヤ𝑝のregretを保存するメモリ ■ capacityをオーバーしたらreservoir samplingで上書き – Iteration毎に重みをランダム初期化して⼀から学習 ※これまでのiterationのregret ̃𝑟J= (𝐼, 𝑎)の平均はtotal regret z𝑅J(𝐼, 𝑎)に⽐ 例→古いデータも残しておく必要があるので,reservoir sampling
  • 26. Π(𝐼, 𝑎|𝜃p)の学習 ■ 平均戦略 l𝜎O(𝐼)を近似 ※CFRでは平均戦略がナッシュ均衡戦略へと収束 ■ Loss function 𝐸 -,J=,4L= ~t{ [𝑡? P M 𝜎J= 𝑎 − Π 𝐼, 𝑎 𝜃p A ] – 𝑀p : 各プレイヤの戦略を保存するメモリ ■ capacityをオーバーしたらreservoir samplingで上書き
  • 27. Algorithm Deep Counterfactual Regret Minimization Algorithm 1 Deep Counterfactual Regret Minimization function DEEPCFR Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs. Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧. for CFR iteration t = 1 to T do for each player p do for traversal k = 1 to K do TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t.
  • 28. Algorithm Deep Counterfactual Regret Minimization Algorithm 1 Deep Counterfactual Regret Minimization function DEEPCFR Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs. Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧. for CFR iteration t = 1 to T do for each player p do for traversal k = 1 to K do TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. ゲーム⽊を探索してデータ収集
  • 29. Deep Counterfactual Regret Minimization Algorithm 1 Deep Counterfactual Regret Minimization function DEEPCFR Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs. Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧. for CFR iteration t = 1 to T do for each player p do for traversal k = 1 to K do TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. Algorithm 𝑉(𝐼, 𝑎|𝜃))の学習
  • 30. Deep Counterfactual Regret Minimization Algorithm 1 Deep Counterfactual Regret Minimization function DEEPCFR Initialize each player’s advantage network V (I, a|✓p) with parameters ✓p so that it returns 0 for all inputs. Initialize reservoir-sampled advantage memories MV,1, MV,2 and strategy memory M⇧. for CFR iteration t = 1 to T do for each player p do for traversal k = 1 to K do TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. Algorithm Π(𝐼, 𝑎|𝜃p)の学習
  • 31. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. if h is terminal then return the payoff to player p else if h is a chance node then a ⇠ (h) return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) else if P(h) = p then . If it’s the traverser’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓p) using regret matching. for a 2 A(h) do v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action for a 2 A(h) do ˜r(I, a) v(a) P a02A(h) (I, a0 ) · v(a0 ) . Compute advantages Insert the infoset and its action advantages (I, t, ˜rt (I)) into the advantage memory MV else . If it’s the opponent’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓3 p) using regret matching. Insert the infoset and its action probabilities (I, t, t (I)) into the strategy memory M⇧ Sample an action a from the probability distribution t (I). return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) Algorithm
  • 32. Algorithm TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. if h is terminal then return the payoff to player p else if h is a chance node then a ⇠ (h) return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) else if P(h) = p then . If it’s the traverser’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓p) using regret matching. for a 2 A(h) do v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action for a 2 A(h) do ˜r(I, a) v(a) P a02A(h) (I, a0 ) · v(a0 ) . Compute advantages Insert the infoset and its action advantages (I, t, ˜rt (I)) into the advantage memory MV else . If it’s the opponent’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓3 p) using regret matching. Insert the infoset and its action probabilities (I, t, t (I)) into the strategy memory M⇧ Sample an action a from the probability distribution t (I). return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) 𝑉(𝐼, 𝑎|𝜃))から戦略𝜎J(𝐼)を計算
  • 33. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. if h is terminal then return the payoff to player p else if h is a chance node then a ⇠ (h) return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) else if P(h) = p then . If it’s the traverser’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓p) using regret matching. for a 2 A(h) do v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action for a 2 A(h) do ˜r(I, a) v(a) P a02A(h) (I, a0 ) · v(a0 ) . Compute advantages Insert the infoset and its action advantages (I, t, ˜rt (I)) into the advantage memory MV else . If it’s the opponent’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓3 p) using regret matching. Insert the infoset and its action probabilities (I, t, t (I)) into the strategy memory M⇧ Sample an action a from the probability distribution t (I). return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) Algorithm Regretを計算して収集
  • 34. TRAVERSE(;, p, ✓1, ✓2, MV,p, M⇧) . Collect data from a game traversal with external sampling Train ✓p from scratch on loss L(✓p) = E(I,t0,˜rt0 )⇠MV,p  t0 P a ⇣ ˜rt0 (a) V (I, a|✓p) ⌘2 Train ✓⇧ on loss L(✓⇧) = E(I,t0, t0 )⇠M⇧  t0 P a ⇣ t0 (a) ⇧(I, a|✓⇧) ⌘2 return ✓⇧ Algorithm 2 CFR Traversal with External Sampling function TRAVERSE(h, p, ✓1, ✓2, MV , M⇧, t) Input: History h, traverser player p, regret network parameters ✓ for each player, advantage memory MV for player p, strategy memory M⇧, CFR iteration t. if h is terminal then return the payoff to player p else if h is a chance node then a ⇠ (h) return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) else if P(h) = p then . If it’s the traverser’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓p) using regret matching. for a 2 A(h) do v(a) TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) . Traverse each action for a 2 A(h) do ˜r(I, a) v(a) P a02A(h) (I, a0 ) · v(a0 ) . Compute advantages Insert the infoset and its action advantages (I, t, ˜rt (I)) into the advantage memory MV else . If it’s the opponent’s turn to act Compute strategy t (I) from predicted advantages V (I(h), a|✓3 p) using regret matching. Insert the infoset and its action probabilities (I, t, t (I)) into the strategy memory M⇧ Sample an action a from the probability distribution t (I). return TRAVERSE(h · a, p, ✓1, ✓2, MV , M⇧, t) Algorithm 戦略を計算して収集
  • 36. 命題 ■ 𝑇をDeep CFRのIteration数,𝐴を任意の情報集合での⾏動数の最⼤ 値,𝐾をiterationごとにゲーム⽊を探索する回数とする. メモリ𝑀y,)が⼗分に⼤きいとき,regret 𝑅) Oは確率1 − 𝜌で以下の式 を満たす. 𝑅) O ≤ 1 + 2 𝜌𝐾 Δ 𝐿) 𝐴 𝑇 + 4𝑇 𝐿) 𝐴 Δ𝜖„ (ただし,𝐿y J を𝑉(𝐼, 𝑎|𝜃) J)のMSE lossとし, 𝐿y∗ J を達成できるMSE lossの 最⼩値としたとき,𝜖„は𝐿y J − 𝐿y∗ J ≤ 𝜖„を満たす) → 𝑇 → ∞のとき, †< ‡ O が4 𝐿) 𝐴 Δ𝜖„で抑えられる
  • 39. Setting ■ Heads-up flop hold’em (FHP) – ノード数10@A,情報集合数10ˆ – Head-up limit Texas hold’emの2 Roundで終了するversion – コミュニティカードは3枚 ■ Head-up limit Texas hold’em (HULH) – ノード数10@‰,情報集合数10@F ■ Baselines – Neural Fictitious Self Play (NFSP) – Abstraction + CFR ■ Abstractionの粗さを⾊々変更 ■ Lossless : 戦略的に同形の状況をまとめるAbstraction
  • 40. Network Architecture ■ 𝑉でもΠでも同じarchitectureを使⽤ ■ カードの数字やマークをembeddingして⼊⼒ Deep Counterfactual Regret Minimization ✓t ) on a sample in MV,p minimum loss achievable ⇤  ✏L. ly large, then with proba- s bounded by p T + 4T|Ip| p |A| ✏L (5) regret RT p T is bounded by ✏L Figure 1. The neural network architecture used for Deep CFR. The network takes an infoset (observed cards and bet history) as input and outputs values (advantages or probability logits) for each possible action. (1-4), and a card embedding (1-52). These embeddings
  • 41. Results : FHP r learning approximate Nash mation games. NFSP approx- hich is proven to converge to ctice does so far slower than CFR reaches an exploitability verges to 47 mbb/g.4 We also re sample efficient than NFSP. d most of their wallclock time our implementation we see a ver NFSP in wallclock time ce of Deep CFR using differ- als, network SGD steps, and f CFR traversals per iteration mes slower but the model con- oitability. This is presumably ons to collect enough data to ly. On the other hand, reduc- s does not change the rate of ymptotic exploitability of the e model architecture as we use k game of Leduc Hold’em, our s an average exploitability (total the professional human competitors lost to the 2007 Polaris AI by about 52 ± 10 mbb/g (Johanson, 2016). In contrast, our Deep CFR agent loses to a 3.3 · 108 bucket abstraction by only 11 ± 2 mbb/g and beats NFSP by 43 ± 2 mbb/g. 10 6 10 7 10 8 10 9 10 10 10 11 1Rdes 7RuFhed 10 2 10 3 (xSlRLtDELlLty(PEE/g) CRnvergenFe Rf DeeS C)R, 1)63, Dnd DRPDLn-6SeFLfLF AEstrDFtLRns DeeS C)R 1)63 (1,000 LnfRsets / uSdDte) 1)63 (10,000 LnfRsets / uSdDte) AEstrDFtLRn (40,000 Clusters) AEstrDFtLRn (368,000 Clusters) AEstrDFtLRn (3,644,000 Clusters) /Rssless AEstrDFtLRn (2340 Clusters) Figure 2. Comparison of Deep CFR with domain-specific tabular abstractions and NFSP in FHP. Coarser abstractions converge faster but are more exploitable. Deep CFR converges with 2-3 orders of n Deep CFRはNFSPよりも早く収束 n Deep CFRのほうがSGD stepsに多くの実⾏時間を費やすため, 総実⾏時間で⾒るとNFSPに⽐べて劇的に早いわけではない
  • 42. Results : FHP n Iterationごとのゲーム⽊探索回数を減らすと収束が遅くなる n IterationごとのSGD steps数はexploitabilityの収束値に影響を与える n 中間層のユニット数を増やすとexploitabilityが減少する Abstraction Size Model NFSP Deep CFR 3.3 · 106 3.3 · 107 3.3 · 108 NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes. For comparison, in 2007 an AI using abstractions of roughly 3 · 108 buckets defeated human professionals by about 52 mbb/g (after variance reduction techniques were applied). 10 1 10 2 C)R ,teratLRn 10 2 10 3 (xplRLtaELlLty(mEE/g) Traversals per Lter 3,000 10,000 30,000 100,000 300,000 1,000,000 /Lnear C)R Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model. Abstraction Size Model NFSP Deep CFR 3.3 · 106 3.3 · 107 3.3 · 108 NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes. For comparison, in 2007 an AI using abstractions of roughly 3 · 108 buckets defeated human professionals by about 52 mbb/g (after variance reduction techniques were applied). Traversals per Lter 10 1 10 2 C)R ,terDtLRn 10 2 10 3 (xSlRLtDELlLty(mEE/g) 6GD steSs Ser Lter 1,000 2,000 4,000 8,000 16,000 32,000 /LneDr C)R Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model. Abstraction Size Model NFSP Deep CFR 3.3 · 106 3.3 · 107 3.3 · 108 NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g Table 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction sizes. For comparison, in 2007 an AI using abstractions of roughly 3 · 108 buckets defeated human professionals by about 52 mbb/g (after variance reduction techniques were applied). 6GD steSs Ser Lter 10 4 10 5 10 6 # 0odel 3araPeters 10 2 (xploitaEility(PEE/g) diP 8 diP 16 diP 32 diP 64 diP 128 diP 256 Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR iteration. The dotted line shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergence using different numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Deep CFR in FHP for different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model.
  • 43. Results : FHP Figure 3. Left: FHP convergence for different numbers of training data collection traversals per simulated LCFR ine shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP converg numbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of D different model sizes. Label indicates the dimension (number of features) in each hidden layer of the model. 10 1 10 2 C)5 IWeUDWLRn 10 2 10 3 (xSlRLWDELlLWy(PEE/g) DeeS C)5 (5 UeSlLFDWes) DeeS C)5 wLWhRuW /LneDU WeLghWLng DeeS C)5 wLWhRuW 5eWUDLnLng IURP SFUDWFh DeeS C)5 3lDyLng UnLIRUP when All 5egUeWs < 0 Figure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, wh exploitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting co exploitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training fro exploitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative matching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR 6 shows the performance of vanilla tabular Linear CFR without abstraction or sampling. Middle: FHP convergen mbers of minibatch SGD updates to train the advantage model at each LCFR iteration. Right: Exploitability of Dee erent model sizes. Label indicates the dimension (number of features) in each hidden layer of the model. 10 1 10 2 C)R IWerDWiRn 10 2 10 3 (xSlRiWDEiliWy(mEE/g) DeeS C)R DeeS C)R wiWh Sliding WindRw 0emRries ure 4. Ablations of Deep CFR components in FHP. Left: As a baseline, we plot 5 replicates of Deep CFR, which loitability curves (standard deviation at t = 450 is 2.25 mbb/g). Deep CFR without linear weighting conve loitability, but more slowly. If the same network is fine-tuned at each CFR iteration rather than training from loitability is about 50% higher. Also, if the algorithm plays a uniform strategy when all regrets are negative (i.e ching), rather than the highest-regret action, the final exploitability is also 50% higher. Right: If Deep CFR is 6 n 各iterationで𝑉をfine-tuningするようにするとexploitabilityが悪化 n メモリをreservoir samplingからsliding windowに変えるとexploitabilityが不安定
  • 44. Results : HULH n NFSPには⼤きく勝ち越し n Abstractionの粗さを⼩さく(細かく)するとやや負ける Deep Counterfactual Regret Minimization Opponent Model Abstraction Size Model NFSP Deep CFR 3.3 · 106 3.3 · 107 3.3 · 108 NFSP - 43 ± 2 mbb/g 40 ± 2 mbb/g 49 ± 2 mbb/g 55 ± 2 mbb/g Deep CFR +43 ± 2 mbb/g - +6 ± 2 mbb/g 6 ± 2 mbb/g 11 ± 2 mbb/g able 1. Head-to-head expected value of NFSP and Deep CFR in HULH against converged CFR equilibria with varying abstraction size or comparison, in 2007 an AI using abstractions of roughly 3 · 108 buckets defeated human professionals by about 52 mbb/g (afte ariance reduction techniques were applied).
  • 45. まとめと所感 ■ DNNをCFRに組み込んだアルゴリズム Deep CFRを提案 ■ 巨⼤なゲームを扱うためのAbstractionの必要性を排除 ■ AbstractionをいらなくしたことでCFRをかなり汎⽤的にしてくれた ■ 簡単なゲームで実装・実験してみたい