2. Overview
2
• I implemented an Othello AI (IaGo) inspired
by AlphaGo algorithm
• AlphaGo is composed of 3 parts:
• SL policy network: predict next action
• Value network: evaluate board state
• MCTS: choose action using 2 networks
3. Background
Game Search space AI Year
Othello 10^60 NEC Logistello 1997
Go 10^360 DeepMind AlphaGo 2016
3
• Go has extremely huge search space: 10360
• c.f. Estimated number of all atoms existing in the
universe: 1080
• Before AlphaGo, it had been thought to take
10 more years for Go AIs to beat human
professional due to its huge search space
• Since I don’t have enough machine resources
for replicating AlphaGo, I made Othello
version
4. Dataset
4
Board state Place of next stone
6 million -> 48 million
• Data were from online Othello game records
• 6 million sets of board state & the place of
next stone
• Augmented them by 8 times using rotation &
transposition symmetry
5. SL policy network (classification)
• Input: 2-ch matrices of board state
• Output: Probability distribution of next choice
• Network: 9 layers of convolution with
softmax output layer
• 57% accuracy of prediction
5
6. RL policy network
• Polished SL policy with policy gradients
-> Reinforcement Learning policy network
• After training, generated teacher data for
value network
• Played games between RL policy networks
-> 1.25 million sets of board state and result
• Augmented by 8 times -> 10 million
6
SL policy network
SL policy network
(opponent)
VS
WIN -> encourage its plays
LOSE -> discourage its plays
(32*400=12,800 times)
7. Value network (regression)
• Input: 2-ch matrices of board state
• Output: Value of the board state
(Win: +1, Lose: -1, Draw: 0)
• Network: 9 layers of convolution (similar to
the SL policy network)
7
Prediction examples
8. Monte Carlo tree search
• Rollout policy: simplified SL policy network
that works faster
• MCTS: search deeper for a good path
1. Make child node by
SL policy network
2. Evaluate current node
by value network and
the result of rollout policy
self-play
3. Update ancestor nodes’ value
4. Choose most visited node
8
9. Results
• IaGo (complete) beat simple SL policy in
approx. 90% of games!
• Still, there is room for improvement…
• It takes too long time for calculation
• IaGo seems to have a weak point
• Teacher data were from games
between amateurs
• Objective/quantitative evaluation is
needed
• Graphical User Interface
-> Upload to web!
9
10. Summary
• IaGo is composed of 3 parts:
• SL policy network: predict next action
• Value network: evaluate board state
• MCTS: choose action using 2 networks
• IaGo became a good player through training
10
Editor's Notes
Thank you Mr. Bayne. Good afternoon!
Recently I learned about AlphaGo, an AI for playing game of Go, and implemented its algorithm in an othello version.
So, let me tell you how I made it and how it works.
AlphaGo is composed of these 3 parts:
First, policy network, that predicts next action.
Second, value network. that evaluates board state.
And third, Monte Carlo tree search, that chooses action using two networks.
So, I’ll now explain them a little in detail.
First of all, let me mention that go has extremely huge search space of 10 to the 360th power.
I guess it's hard to imagine, So I'll give you one example.
Estimated number of all atoms existing in the universe. It's 10 to the 80th power.
Again, the search space of Go is 10 to the 360th power, so it's far far far bigger than the number of all atoms in the universe .
Because of this huge search space, before AlphaGo, it had been thought to take 10 more years for Go AIs to beat human professionals.
Imagine what a big achievement AlphaGo made!
But since I don't have enough machine resources for replicating AlphaGo, I made an Othello version.
The search space of Othello is just 10 to the 60 power.
I’ve now told you about the background.
I’ll move on to dataset I used for training IaGo.
Data were from online Othello game records that you can get for free on the internet.
It includes 6 million sets of Board state and the place of next stone.
Then I augmented them by 8 times using rotation and transposition symmetry.
So finally, I got 48 million sets of board state and the place of next stone.
The first part of IaGo: Supervised Learning policy network.
It got 2 channel matrix of board state as an input, and output probability distribution of next choice, next action.
The network was 9 layers of convolution with softmax output layer.
After training, it predicted human plays at the accuracy of 57%.
Next, I polished SL policy network with policy gradients algorithm.
The polished network is called reinforcement learning policy Network or RL policy network for short.
In the process of reinforcement learning, 2 SL policy networks played games against each other.
Parameters of network was updated so that good actions were encouraged and bad actions were discouraged, according to the result of the game.
I repeated this for more than 12000 times.
After training, RL policy network generated teacher data for value Network.
2 RL policy networks played games against each other.
Then I got 1.25 Million sets of board state and result.
Again I augmented them by 8 times so finally I got 10 million sets of Board State and result.
Next I'll talk about Value Network.
This Network is very similar to the SL policy Network in terms of the structure.
What’s the difference?
While SL policy network is for classification of next action, value network is for regression of the game result.
Value network gets 2 channel Matrix of board state and outputs the value of the board state.
I defined the value of the Board State as +1 for win, - 1 for lose, and 0 for draw.
So the value means the likelihood of winning of the white player.
Look at the example pictures.
For the left one, white player is almost winning, so the value is 0.67 roughly equal 1.
For one on the center, the white player is almost losing so the value is nearly equal to -1.
And for the right one, you'll never know the result so the value is around to 0.
Let's move on to the final part of the algorithm, Monte Carlo tree search.
First I made a rollout policy. This is a simplified SL policy Network.
Its prediction accuracy was lower than SL policy network but worked much faster.
In MCTS I have to run many many simulations so I need a predictor that works fast.
MCTS, in short, is an algorithm that searches deeper for a good path in the game tree using self-play simulation.
And it’s composed of four steps.
Step 1, make a child node by SL policy Network.
Step 2, evaluate current node by value Network and the result of rollout policy self play.
Step 3, update ancestor nodes’ value according to the rollout policy self-play.
Step 4, choose most visited node.
I’ve told you about the algorithm of Iago, so I’ll now talk about its performance.
Iago played some games against simple SL policy Network and won approximately 90% of games.
Still, there is room for improvement.
First, it takes too long time for calculation.
If I can make it shorter, then IaGo can run more simulations and will become stronger.
Second Iago seems to have a weak point. The picture on the right side was taken when I beat complete version of Iago. I took all of its stones, and the game was over in the course of it.
I'm not sure about it's cause, but I guess one reason is that teacher data were from games between amateur players, not professionals.
Fourth, I couldn't really evaluate IaGo’s performance in an objective or quantitative way, so a more appropriate evaluation is needed.
And finally, I’d like to develop a sophisticated graphical user interface and uploaded it to the web so that everyone can play it easily just by clicking.
Let me summarize my presentation.
I’ve explained IaGo’s algorithm and its performance.
IaGo is composed of three parts.
SL policy Network that predicts next action
Value network that evaluates board state.
Monte Carlo tree search that uses action using these two Networks.
And Iago became a good player through training using huge dataset.
That's it for my presentation. Do you have any questions?