1. Google confidential | Do not distribute
Deep Q-Network for beginners
Etsuji Nakai
Cloud Solutions Architect at Google
2016/08/01 ver1.0
2. $ who am i
▪Etsuji Nakai
Cloud Solutions Architect at Google
Twitter @enakai00
3. Typical application of DQN
▪ Learning “the optimal operations to achieve the best score” through the screen
images of video games.
– In theory, you can learn it (without knowing the rule of the game) by collecting
data consisting of “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– This is analogous to construct an algorithm for the game of Go by collecting
data consisting of “On what face of board, where you put the next stone, how
your advantage will change.”
https://www.youtube.com/watch?v=r3pb-ZDEKVghttps://www.youtube.com/watch?v=V1eYniJ0Rnk
4. Theoretical framework of DQN
▪ Suppose that you have all the quartets (s, a, r, s') for any pair (s, a), meaning “with
the current state s and the action a, you will have a reward (score) r and the next
state will be s' ”
– This corresponds to the data “on what screen, with which operation, how the score will
change, and what will be the next screen.”
– It is impractical to collect data for all possible pairs (s, a), but suppose that you have
enough of them to train the model to a certain level.
– Note that, r and s' are functions of the pair (s, a) in terms of mathematics.
▪ You may naively think that the following gives the optimal action given the current
state s.
⇒ Choose the action a which maximizes the immediate reward r .
– But this doesn't necessarily result in the best scenario. In the case of Breakout, you’d
better hit blocks near the side walls even though it may take a little longer.
▪ In a nutshell, you have to figure out the action a which maximizes the long term
rewards.
5. Let’s imagine the magical “Q” function
▪ First, we define the total rewards as below.
– sn
and an
represent the state and action at n-th step. is a small number around 0.9
introduced to avoid the sum becomes infinite.
▪ Now suppose that we have a convenient magical function Q(s, a) as below although
we don’t know how to calculate it at all.
– Q(s, a) = “The total rewards you will receive when you choose the next action a,
and keep choosing the optimal actions afterwards.”
▪ Once you have the function Q(s, a) , you can choose the optimal action at state s
with the following rule.
⇒ Choose an action a which maximizes the total rewards if you keep choosing the optimal
actions afterwards.
6. The black magic of “recursive definition”
▪ Although we are not sure how we could calculate Q(s, a) , I can say that it satisfies
the following “Q-equation”.
– See the next slide for the mathematical proof.
7. Proof of the Q-equation
– Suppose that the chain of states and actions is as below when you start from the state s0
and keep choosing the best actions.
– From the definition of Q(s, a), the following equation holds.
– Now suppose that the initial state is s1
instead of s0
, and you keep choosing the best
actions, the chain will be as below.
– Again, from the definition of Q(s, a), the following equation holds.
– Rearrange (1) as below, and substitute (2).
– Considering the following relations, it’s equivalent to the Q-equation.
―― (1)
―― (2)
8. Approximate “Q” function using Q-equation
▪ Prepare some function with adjustable parameters, and by adjusting them, you may
find a function which satisfies the Q-equation.
– If you succeeded, now you have the “Q” function!
– Strictly speaking, Q-equation is not a sufficient condition but a necessary one. However,
under some assumptions, it’s been proved to be sufficient.
▪ Here’s the steps to adjust the parameters.
– 1. Let all the quartets (s, a, r, a') you have is D.
– 2. Prepare an initial candidate of “Q” function as Q(s, a | w).
– 3. Calculate the following error function E(w) using all (or a part of) data in D. This is the
sum of squared differences between LHS and RHS of the Q-equation.
– 4. Adjust the parameter w so that E(w) becomes smaller. Then, go back to 3.
▪ After repeating 3. and 4., if E(w) becomes small enough, you have the approximate
version of the “Q” function.
– You’d better use more complicated candidate Q(s, a | w) , so that you would have better
approximation.
9. What is “the more complicated function”?
▪ Yes!
The Deep Neural Network!
▪ “Deep Q-Network” is essentially a multi-layer neural network used as the candidate
of Q-function.
10. By the way, what is Neural Network?
▪ Roughly speaking, it’s just a combination of multiple simple functions resulting in a
highly complex function.
11. How do you collect the quartets?
▪ How do you collect the quartets (s, a, r, a') in the real world?
– Basically, just keep playing the game with random actions.
– In theory, if you keep playing for infinite time, you would encounter all the possible states.
▪ But in reality, the possibility to reach some states with random actions is quite
small. To compensate it, you can take the following strategy.
– Once you have collected some amount of data, train the Q-function using these data.
– After that, you play the game by mixing random actions and the (presumably) best actions
calculated from the current Q-function.
– When you have collected some more additional data, train the Q-function again.
– Through this cycle, you can make Q-function better, and collect more data including states
which is unreachable with only random actions.
▪ Why don’t you play only using Q-function without random actions?
– It doesn’t work. By collecting all kinds of states even with random actions, the
model can learn “how to gain long term rewards through some non-rewarding
states.”