Using Deep Q-Learning and Experience Replay with Lunar Lander

Using Deep Q-Learning with
Lunar Lander
CHRISTOPHER EICHER | HANS SAUMER | AAKASH CHOTRANI

Contents
The Problem .......................................................................................................................................5
Lunar Lander ...................................................................................................................................5
Q-Learning ......................................................................................................................................5
Table vs. neural network ...............................................................................................................6
Neural Networks..............................................................................................................................6
Biological Neurons........................................................................................................................6
Artificial Neurons..........................................................................................................................6
Deep Networks ............................................................................................................................6
Our Network................................................................................................................................7
The Tools............................................................................................................................................7
Tensorflow......................................................................................................................................7
Steps to work with TensorFlow:......................................................................................................7
TensorFlow vocabulary:.................................................................................................................7
GPU vs. CPU .................................................................................................................................8
Verify script .................................................................................................................................8
Building the network.....................................................................................................................8
Session........................................................................................................................................8
Predict ........................................................................................................................................8
Train...........................................................................................................................................8
The Network................................................................................................................................9
PyCharm .......................................................................................................................................10
Anaconda......................................................................................................................................10
The Process.......................................................................................................................................10
Cartpole in Keras............................................................................................................................10
Code architecture ..........................................................................................................................10
Customizable layers........................................................................................................................11
Batching........................................................................................................................................11
Optimizers ....................................................................................................................................11
Gradient descent ........................................................................................................................11
Adam optimizer..........................................................................................................................11
Graphing Log Files..........................................................................................................................12

Setting window title....................................................................................................................12
Selecting actions ............................................................................................................................12
Greedy......................................................................................................................................12
E-Greedy ...................................................................................................................................12
Softmax.....................................................................................................................................13
TensorBoard..................................................................................................................................13
Other sources................................................................................................................................13
Experience Replay..........................................................................................................................14
Train many times, low rate ..........................................................................................................14
Replay Bank as an Array ..............................................................................................................14
Adjusting sample size ..................................................................................................................15
Training the Network......................................................................................................................15
Training After Episode .................................................................................................................15
Training Every Step .....................................................................................................................16
Loss functions................................................................................................................................16
One-hot Alternative....................................................................................................................16
The difference............................................................................................................................17
Ending Early ..................................................................................................................................17
The Final Approach ........................................................................................................................18
Missed Opportunities.........................................................................................................................19
Having a well-established vocabulary ...............................................................................................19
Strict adherence to Q-Learning ........................................................................................................19
Test automation ............................................................................................................................19
Target networks.............................................................................................................................19
TensorFlow and the GPU.................................................................................................................19
Jupyter Notebooks .........................................................................................................................19
Too slow .......................................................................................................................................19
Fine tuning parameters...................................................................................................................20
Replay Bank size: ........................................................................................................................20
Learning rate and other Optimizer parameters...............................................................................20
Discount rate .............................................................................................................................20
Network size ..............................................................................................................................20
Better explore policy management...................................................................................................20

Configs .........................................................................................................................................20
Reward or gradient clipping, Hubert loss functions.............................................................................21
Preload with human trials ...............................................................................................................21
Conclusions.......................................................................................................................................21
Experience Replay was Key..............................................................................................................21
Preloading experiences and e-greedy ...............................................................................................21
Neural nets were not very stable .....................................................................................................21
Needed a large rolling data set of experiences ...................................................................................21
What we did well ...........................................................................................................................21

The Problem
Lunar Lander
The aim of this project is to solve lunar lander challenge using reinforcement learning. Our approach uses a
Deep Neural Network which is constructed using TensorFlow and a method of reinforcement learning called
Q-Learning; when the two are combined, it’s known as a DQN (Deep Q Network).
Open AI Gym is an open source platform which provides environments ranging from classic problems such as
Cart-Pole and Lunar-Lander to Atarigames like Breakout and Pac-man for reinforcement learning research.
In the Lunar Lander problem, the agent receives
observations (or states) and a reward for eachaction the
agent took. The observation contains the information
the agent can see. The observations that the agent get
are x-position, y-position, x-velocity, y-velocity, lander-
angle, angular-velocity, right-leg grounded, and left-leg
grounded. This makes the state space continuous and 8-
dimensional.
The discrete action space is: Do nothing, Fire
Main(bottom) engine, Fire left engine, and Fire right
engine.
The agent receives a reward at each step, which consists of taking an action and getting a reward and new
observation. A successful landing will give the agent 100 points, a crash landing or flying off theside would give
it -100 points. Each leg that contacts the ground nets the agent 10 points, and firing the main thruster would
subtract 0.3 points per frame from the reward.
Our goal is tomake theship land softly on the ground using reinforcement learning. The problem is considered
solved when the lander finishes with an average score greater than 200 over 100 consecutive episodes. An
episode begins in a pseudo-random state and ends when the lander lands successfully, crashes, flies out of
bounds, or reaches 1000 steps.
Q-Learning
Q-learning is a reinforcement learning technique. It’sused to find an optimal action-selection policy. This policy
is used to make decisions about what action to take at a given state. The agent learns a function that predicts
the reward of taking an action for a given state so that it can take optimal action for any state. This function is
a modified form of the Bellman equation.
From Wikipedia:

Table vs. neural network
The simplest way to think about Q-Learning when you have a discrete action space and a discrete observation
space is to create a table of expected rewards. With this table and a given action and state, you can look up
the expected reward for taking that action for that state. This simple implementation doesn’t scale well and
doesn’t work with continuous states/actions without breaking up the continuous space into an approximate
discrete space, which isn’t practical in most cases. We use a neural network to approximate this function.
Neural Networks
We read about solutions that solved this challenge using deep neural networks. We chose to approach the
problem using deep neural networks made with TensorFlow.
Biological Neurons
The structure of a biological neuron is shown in thefigure
to the right. Neurons are a basic working unit of brain
cells within the nervous system that transmit information
to other nerve cells, muscles, etc. The dendrites are
covered with synapses which receive messages from
other neurons. The messages are electrical impulses
which aretransmitted through axon. Thereare100 billion
neurons in the brain. Based on the strength of impulses
some neurons are fired some are not.
Artificial Neurons
The artificial neuron was inspired by
biological neurons (However biologists will
happily point out that they really don’t model
how biological neurons work) the inputs are
multiplied by weights and summed up with a
bias. The summed values arethen passed
through an activation function like rectified
linear (ReLU), sigmoid, etc. Ifthey pass the
threshold, then the output of the neuron is
fired and fed into another neuron or output
layer.
𝑦 = 𝑓(Σ𝑥𝑖 ∗ 𝑊𝑖 + 𝑏)
Deep Networks
A deep network is simply a neural network that doesn’t connect its input directly to theoutput layer. The fully-
connected layers in between areknown as the hidden layers. When hidden layers areused with Q-Learning it
is referred to as Deep Q-Learning.

Our Network
The Tools
Tensorflow
There are many libraries that are available to use neural networks such as Theano, Keras, Torch, and
TensorFlow. We chose TensorFlow because it is widely used, well documented, and there are many tutorials
available for getting started.
TensorFlow is an open-source software library for numerical computation using data flow graphs. The edges in
the graph areweights which arestored as tensors, or specialized matrices, and the nodes represent the results
of using matrix multiplication on the previous layer by their weights.
Steps to work with TensorFlow:
1. Build Neural Net: Define initial weights, biases, the number of neurons, layers, the shape of inputs and
outputs.
2. Train Neural Net: Provide a series of input values and try to reduce loss.
3. Predict: After the neural network is trained, it’s used to make predictions based on new input.
TensorFlow vocabulary:
Tensor: A central unit of data in TensorFlow. It consists of an array of any number of dimensions. A tensor has
a rank which represents the number of dimensions. The shape describes the sizes of those dimensions.
Example Classic Description Rank Shape
3 Scalar 0 []
[1,2,3] Vector 1 [3]
[[1,2,3],[4,5,6]] Matrix 2 [2,3]

[[[1,2,3]],[[7,8,9]]] 3 [2,1,3]
Placeholder: It’s createdto accept external inputs into the graph. A placeholder is a promise to provide value
later.
Constant: The value of a constant is provided at the time of initialization.
Session: To evaluate nodes we must run the computation graph within a session.
GPU vs. CPU
TensorFlow can run operations on the GPU instead of the CPU which can improve performance for large
neural networks.
Verify script
Installation of the GPU version can be difficult because there is a lot that can go wrong. Luckily someone
went through the trouble of creating a script that would diagnose any installation problems.
https://gist.github.com/mrry/ee5dbcfdd045fa48a27d56664411d41c
Building the network
We used a feed-forward network which means that information flows one way through a network. We create
variables, and then we specify an operation (like matrix multiplication) by calling a function where we pass in
tensors as arguments and the function returns a tensor with the output of that operation. We can then take
that output and use it as input for the next operation.
Session
To begin the process of training and using the network for predictions we create a new session and initialize
all the variables. Once that is done, we can feed it input and get back the results.
Predict
Once we have a session we can pass the session the state of our world, which is an 8-dimensional vector, and
it will output the expected reward for each action the agent could take in the form of a 4-dimensional vector.
Train
To train the network we pass it the input we want to train on with the addition of the desired output. There
will be a measurable difference between the output of the network and the desired output of the network;
this is referred to as the error or loss. The optimizer will train the network by adjusting the weights and biases
in the network to minimize the error.

PyCharm
We used PyCharm for most of the project. It had very good git integration and a great merge tool. It has great
syntax highlighting, code completion and has a great interactive debugger. PyCharm also highlights code that
doesn’t conform to PEP, the Python style guide, which was a great way to keep the project code style
consistent.
Anaconda
There are many free Python libraries, and they often extend or use other Python libraries. This causes a lot of
dependencies that need be managed. TensorFlow requires a specific version 64-bit Python and relies on
NumPy. In our project, we also use Matplotlib which also depends on NumPy. We found that Anaconda is a
great way to abstract away those problems, allowing us to have multiple versions of Python and Python
libraries on our machines. It downloads all the libraries we needed and allows us to easily switch between
custom tailored Python environments.
We originally intended to work on an Atarigame, but wehad difficulties getting thedependencies set up. There
were also some interesting walking simulators but those required a Mujco license. We got the Box2D based
environments working so we went with Lunar Lander.
The Process
We initially started with TensorFlow getting started documentation and MNIST example which is essentially
the “Hello World” for machine learning with TensorFlow. Given images of handwritten digits, the goal is to
classify the images by what digit is represented in the image. Our first attempt at making neural network was
based on the network used to solve the MNIST classification problem.
Cartpole in Keras
Some of our initial work wasto solve another OpenAI challenge calledcart-pole because it was smaller in scope.
It was solved using Keras instead of TensorFlow.
In the cart-pole challenge, we used a version of experience-replay; the agent plays 1000 games taking random
actions and the episodes that happened to win were saved and trained on. This method was tried on Lunar
Lander with poor results.
Theoretically, this model-free reinforcement learning algorithm could be used with other environments
because we are not providing explicit rules about how the world worked. We simply provided states and
rewards to the agent. However, this didn’t solve the lunar lander problem, likely because the number of inputs
and possible actions increased making this a more difficult problem to solve.
Code architecture
The main functionality of the Agent occurred in 3 functions: reset, step, and end. The reset function was
called at the start of every episode. It would reset the environment, and a few episode specific variables. The
step function is where the agent would choose its action and step the environment forward. Depending on
the implementation, the agent was either trained here or in the end function. The end function is called at the

end of each episode. This would be used to output some debug info and sometimes was used to train on a
batch of previous episodes.
A configuration class was createdto store parameters for use with theagent class. This allowed us to configure
the parameters of our Agent in a simple way. We intended to save these configurations to file for possible
replaying or test automation but that was never implemented.
To increase work flow, we had multiple main files within theproject (this is trivial todo in Python). This allowed
us to test and edit different implementation with minimal conflicts between each other. Dueto Python's nature
of being able to run any individual file as main, this was very easy. We were also able to test to see if changes
we made broke the other’s version of the project. This way we could make major changes without stepping on
each other’s code when it was unnecessary. In one of the Main files, we made it easy to switch between using
the CPU and GPU version of TensorFlow, since the GPU version could be difficult to use under some situations.
Customizable layers
One of the first custom features we added to our project was the ability to easily change the size and depth of
our hidden layers. No other project we found allowed for this customization but it was valuable for us to try
new neural network shapes without having to modify the source code that we shared. The sizes of the hidden
layers were passed in a parameter, making it easily customizable.
Batching
There are some operations that we didn’t want to happen every single episode. Rendering the environment
every episode will significantly affect how fast our experiments run so we don’t render every episode. When
logging to file wealso don’t writeto file every episode. We had a batch_size parameter,andafter completing
a batch_size number of episodes we would render an entire episode and do any file I/O we needed. This
was also a good time for us to print any debug info we wanted.
Optimizers
Gradient descent
In the very first attempt at a custom network, we tried to use gradient descent but the output would diverge
to infinity so we had to stop using it. We switched optimizers and revisited this one later, and we figured out
that our learning rate was too high, but we didn’t see any advantage to switching back to gradient descent so
it was ultimately not used in the final version.
Adam optimizer
The Adam optimizer is similar to the gradient descent optimizer except that it uses an extra level of derivatives
and a few other minor tweaks. This optimizer is more computationally expensive but we chose to use it because
most of the material we were referencing were using the Adam optimizer for this and related problems.

Graphing Log Files
Visualizing our results with a
live graph has been
informative. It allows us to
cut experiments short if they
have a low score and
extreme variance, which
saves us a lot of time.
Keeping records of previous
experiments is valuable. We
originally overwrote log files
in case you wanted to rerun
experiments because you discovered the last one had something wrong with it, and you wanted to tweak it
and restart it. This turned out to be a terrible idea and made it easy to lose log files. You could corrupt a log file
by running two experiments at the same time with the same name. We started adding timestamps to the
names of the experiments so that each one was unique, this was much more practical. We often delete the
logs of runs that crashed or were manually terminated early, there wasusually not a lot of valuable information
there. For other experiments, we simply archived them in a subfolder when we were done with them. We
found it helpful to increase the number of available line colors and styles so that you could have more runs on
screen without theinformation on thegraph becoming diluted and unreadable. We also made sure to highlight
runs that were in progress (I could tell because of the file modification timestamp) and to also draw those on
top of other lines, and order the legend by most recent log file creation. This doesn’t happen automatically, we
had to reverse the ordering of the legend to achieve this.
Setting window title
We also dove into OpenAI’s source code so that we could changethe window titleof the environment, allowing
us to identify what experiments we were looking at. We are considering making it less hacky and finding out
how hard it is to contribute to the open source project. This basic feature greatly improved our quality of life
when running multiple experiments.
Selecting actions
Greedy
One of the important factors used in the reinforcement system is the policy that determines how the agent
takes an action. To do this we run the neural net by feeding it the current state and getting the output. The
output is an array of 4 values that represents the expected future reward for taking each action. The greedy
policy simply means taking the action with the highest expected reward.
E-Greedy
Another approach is to use e-greedy. This policy defines an epsilon where a random action is taken with a
probability of epsilon and the best action is taken with a probability of 1 - epsilon. An epsilon value that is too
small will result in the agent converging to local minima as opposed to the global minima, thus a relatively high
epsilon value is used and reduced over time.
This approach allows the agent to “explore” its choices to discover more optimal solutions.

Softmax
The main disadvantage with an e-greedy approach is that even if two actions are relatively equal in being the
optimal action, only the single highest value is considered. The second value, even if it is only slightly less
optimal, will only be chosen with a probability of .25 * epsilon. This makes it a lot harder for the agent to
explore other potentially optimal solutions.
An e-greedy policy utilizing argmax did not allow the agent to know that there may be a second action that is
nearly as optimal as the most optimal action. It would either choose a random one, or themost optimal action.
By implementing a softmax function, the agent would choose an action based on its weight compared to the
others. We also implemented a temperature variable that allows the probabilities that softmax returns to be
skewed to become closer together, all the values are closer to .25, or farther apart, a single action nearing 1.0.
A high temperature will introduce more randomness as it makes the weights closer together. A low
temperature will cause the policy to converge to the best action. The probability returned from the softmax
function is the exponential of the rewarddivided by temperature divided by thesum of theexponentials of the
rewards divided by temperature, shown below:
// note: value of reward was clamped between -500 and 500 to prevent overflow
Probability[action] = exp(rewards[action] / temp) / sum(exp(rewards[i] / temp)
Softmax returned the probability that an action is an optimal action. Thus, an action was selected at random
using these probabilities.
Problems that arise from using this policy include values converging to 0 and infinity. As the rewards get too
largeor small, theexponential function of softmax will become divergent. The reward values must be clamped
in order to prevent this from happening which loses some of the information stored in the actual value. For
example, when all of the rewards become less than -500 the probability distribution results in equal chances
for every action. This happens even if 3 of the rewards are -1000 and one is -500. Clearly, the action that results
in -500 is the best action, but due to the clamping that prevents overflow errors, it does not end up with an
optimal action.
TensorBoard
TensorBoard is an application that is used to visualize TensorFlow graphs. Unfortunately, we didn’t get a lot of
use out of it. It got us touse name scopes tohelp organizethegraph and see it visually. When wegot it working,
the graph looks more complicated than we anticipated and didn’t serve us well for showing the organization
of our network. We also intended to use TensorBoard to see the value of the weights the graph was using,
possibly live, but we didn’t see the use of continuing this line of investigation.
Other sources
As we are discovering more resources and alternative solutions we are finding ourselves picking features and
techniques to integrate into our own projects and looking for ways to optimize what we know (replay bank,
controlling how much data is passed to TensorFlow, batching our TensorFlow operations, avoiding garbage
collection).
Our first resource was ok, but it didn’t explain its approach very well and the code is not as well written:

https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-learning-
with-tables-and-neural-networks-d195264329d0
This website has some good articles on the theory and implementation on experience replay, but its source
code is a bit cryptic in places:
https://jaromiru.com/2016/09/27/lets-make-a-dqn-theory/
This is someone’s capstone project write-up. It’s an easy read for learning about his process but he uses a
different library than we do. His source is cryptic but he gave us a good idea of what sized network to use
(hidden layers of 50, 40):
https://github.com/dennisfrancis/LunarLander-v2/blob/master/report.pdf
This person’s source code was well written, so we made some important observations from reading through
it; mainly that you can train on many samples every step. This person’s implementation was flawed, however.
This person used the alternative one-hot version of calculating loss and had to use a massive network (hidden
layers of 256, 256, and 500) to solve the problem. Our solution used a significantly smaller network:
https://github.com/Seraphli/YADQN/blob/master/code/openai/LunarLander-v2/Experiment_5/evaluation.py
Experience Replay
Experience replay is when, instead of training on every new observation you make, you store your observation
(or experiences) into a bank, and train on them at a later time. There are several advantages to this approach.
It is more efficient to train on a batch of observations, rather than training on each one individually. When
training every step, you don’t have to train on just one observation, you can train on hundreds. Neural
networks, even large ones, don’t have an infinite capacity to learn, every time it optimizes for some state it
may do so at the expense of other states. Ideally, it only optimizes out tendencies that had a negative impact
on performance. However, if you were to train on every step of an episode sequentially, the neural network
will develop a bias for its most recent experiences and as such “forgets” the experiences it had farther back.
By saving experiences later we can sample them at random, eliminating bias on most recent experiences. This
allows the neural network to more easily converge on a more generalized strategy that isn’t biased or overfit,
to its most recent experiences. A good strategy to use with experience replay is to preload the bank with
experiences gathered by taking random actions. This allows you train on more samples from the very
beginning. Before we were preloading with experiences, that cap of the sample size we could train with was
limited to the number of experiences accrued, this limited the amount of learning that could be done in the
earlier episodes.
Train many times, low rate
We think that one of the biggest advantages experience replay afforded us is that we could now train many
times per step. We could learn more from eachstep wetook by saving it to the replay pool. Inthis way, a single
step can be trained on many times but mixed in with other states so that we do not over fit to those states.
Replay Bank as an Array
The resources that we found typically used a list or a deque for storing experiences. Using a NumPy array
structure for the experience replay bank seems to bring performance improvements, likely because it
eliminates a lot of garbagecollection by Python.

Adjusting sample size
Since we found that the agent solves the problem when using experience replay, we decided to try
implementing it and then edit some of the policies that we were using. The main aspect that we adjusted is
the use of the softmax action selection policy instead of an e-greedy approach. After changing this, we ran
several tests to see how this method combined with the experience replay when using different batch sizes.
The first tests we ran had batch sizes of 1, 10, 100, and 1000. A batch size of 1 was subpar and did not even
come close to solving the problem. Batchsize of 10 had results similar tothe results of theprevious testsbefore
implementing experience replay. We had to stop the batch size of 1000 because it would have taken way too
long to finish and it did not appear to be converging to a solution. We found that all the tests resulted in an
agent that would slow down to the landing pad, but they would not stop on the landing pad and thus would
not be considered in the “landed” state. We also saw that taking no action was considered the best action, but
the softmax was preventing it from taking this action as it only had a probability of around 40% to select this
action.
Since the previous tests showed that it was not converging fast enough with the current parameters of the
softmax function, we decided to run some tests adjusting the temperature parameters. The neural network
preloaded a few thousand random actions, so we decided to lower the temperatureso it is more likely to select
the best action. It would have a few hundred episodes of training on the random data before it would start
training on the new data. We found that with too low of a temperature, the probabilities of taking each action
converge on 25%. This is because we had to clamp the values between -500 and 500. This means that after
many losses all the rewards would be clamped to the same value which was not intended. We had to end the
tests while we figured out a solution to this.
Training the Network
Training After Episode
One strategy was to wait to the end of an episode to do any training. The agent did not use any of the other
rewards when it trained once every step. This makes the agent very short-sighted so it may not find a long-
term optimal strategy. Inorder to factor in the long term reward, we tried waiting until the end of the episode
to begin training the neural net. During the episode, every action, state, and reward was saved tobe processed
later. Starting from the last step of the episode, the reward of that step is adjusted by adding the reward from
the next step multiplied by the discount rate. This is represented in the following code:
// ignore the reward from the final step
for (i = rewards.size - 2; i >= 0; --i)
rewards[i] = rewards[i] + discount_rate * rewards[i + 1];
Updating the rewards in this manner allowed the agent to see that certain actions may lead to certain end
states. Considering that the final reward was usually either -100 or 100, this greatly affected the rewards and
would push them towards a certain result (good or bad). Training at the end of the episode on every step from
that episode also proved to be more efficient.

The method of training the network at the time was to train each step. We had an updated method of
calculating the reward used for training. The network would get the next state’s rewards in order to get its
expected reward from going to the next state using the following formula:
new_reward = raw_reward + discount_rate * expected_reward(next_state)
The expected reward of the next state depends on what action selection policy is being used. Using softmax,
the expected reward is calculated using the softmax function. Using argmax alone, the expected reward is
simply the maximum reward of the next state.
We decided that since the previous method of training at the end was performing somewhat decently, we
would try to do this new method of calculating reward at the end of the episode. This would allow us to start
at the end of the episode and update the rewards backwards, thus ensuring that the new information is used
when calculating expected reward.
The agent did not improve and was, in fact, worsened tremendously. We think it ended up training too much
on the future methods and would only take actions that were beneficial in the long run. This turned out to be
no action, so it would not try correcting itself until it was too late. We then reverted to how it was before,
where it had a small amount of knowledge about future actions.
Training Every Step
In the end we did switch back to training during every step, but instead of training on that step’s observation,
we took a large sample from the experience replay bank and trained on that.
Loss functions
Loss is the error between what the network produced and what wewant it to produce. The optimizer discovers
the best values for the weights and biases to get the results we want. To do this, the optimizer finds the
derivatives of the weights and biases with respect to the loss and adjusts them to minimize the loss. We used
mean squared loss which means the loss is the squared difference betweenthe values returned by the network
and the target values we want the network to return.
There is also cross-entropy loss, but this works better on sigmoid activation functions. We are using ReLU, a
linear activation function, so it’s better to use a quadratic loss function like mean squared.
http://neuralnetworksanddeeplearning.com/chap3.html
To calculatetheloss, weget the expected rewardsthat the network produces, and then weadjust the expected
reward for the action that was taken using the Q-Learning algorithm. Then we feed this new set of expected
rewards (where only one reward was changed) and this gets subtracted from the original array of expected
rewards. The result will be an array of zeros except for the reward associated with the action that was taken,
that element will contain the difference between the old and new value. This array is then squared, and then
we take the sum of all the values in the array (which is just the difference squared). This is the loss, and this
value gets fed to the optimizer.
One-hot Alternative
We found a resource that provided analternative wayto calculatethe loss in one of our references. It produces
the same final loss value, but produces different derivatives for the optimizer to adjust. Instead of giving the
network the entire expected rewardsarrayto thenetwork as thetarget, wepassit a one-hot arrayrepresenting

the action, and the expected reward for that action. The original expected rewards array is multiplied by the
one-hot array to isolate the expected reward of the action that was taken, and then we take the difference
between the sum of that array with the expected reward we pass in. This is then squared and passed to the
optimizer.
We tried this with the version we had at the time, which wasn’t quite solving the challenge yet. It would very
quickly get many wins, but not enough to be considered solved. It would then fall off into a very sporadic state.
After switching the loss functions the agent didn’t approach a solution to the problem, so although it proved
to be more stable, we switched back to the original loss functions.
The difference
We believe that one-hot version of calculating loss may have its applications but does not work for this project.
We believe that the derivatives produced in this manner will not account for any error introduced in the
expected rewards for the other three actions. Because those values are multiplied by zeros any error produced
by changing the weights is hidden. The former method uses subtraction to isolate the difference between the
original expected rewardsand the intended expectedrewards. When changing theweights createsa difference
in the expected rewards for the other threeactions, this is reflected in the derivatives. The latter method hides
the introduced error because the three other expected rewards are multiplied by zero, so any error caused by
changing the weights is hidden from the derivatives used by the optimizer. We consulted with Professor Bede
and he believes this reasoning is correct.
Ending Early
One of the things we did to save time is to track the average score of the previous 100 episodes so that we
could declare the experiment a success and end early. This shifted our focus from reaching a stable solution to
reaching a solution faster. We also begin recording time-stamps so that we could record how long the runs
were taking.

Missed Opportunities
Having a well-established vocabulary
It took us a while to establish theshared vocabulary we used to
describe the algorithms and data. We would pick arbitrary
names but as we learned more we found more standard
nomenclature. The worst case of this was probably using the
word “batch” to mean different things.
Strict adherence to Q-Learning
We started with only a rough idea of what we Q-Learning
looked like in practice so we ended up with a few versions that did not strictly follow adhere the Q-Learning
strategy.
Test automation
We would have like to set up a way to predefine a series of experiments in the form of config files and have a
program run experiments in a few threads but the experiments took so long there wasn’t a lot of benefit to
doing this.
Target networks
One of the techniques we wanted to use was target networks. The idea is you use two networks concurrently,
one that makes the decisions, and one that learns from those decisions. Periodically, the network that is being
trained gets its weights and biases copied over to the network that is making the decision. This process slows
the rate at which the network learns, and is supposed to make the network more stable.
TensorFlow and the GPU
We aren’t sure that we saw gains from using the GPUversion of TensorFlow. With more time wemay have run
some tests to check. The networks we were using may have been small enough that just transferring data to
the GPU could have been a bottle neck. We have no way of knowing for now.
Jupyter Notebooks
Jupyter notebooks present a new paradigm for coding, particularly for code assignments in Python or R. We
haven’t done much with them but they seem to be a great way to write and document code, or present code
as assignments. If we had more time we think they could be worth looking into for the machine learning
curriculum. They seem well suited for academic environments.
Too slow
Running experiments took a lot of time. We couldn’t do as much experimenting and tweaking as we wanted
because it would take hours to learn if making a small change to one of the parameters would take any effect.
It would have been nice to have access to more powerful machines. From our research, I’ve found that there
are multiple ways one could run these experiments in the cloud. AWS has servers with high end graphics cards
for machine learning work. We found this web page: http://cs231n.github.io/gce-tutorial/

It goes into using Google cloud services and Jupyter notebooks to run experiments on high end servers.
Fine tuning parameters
There are some parameters that we could be exploring more:
Replay Bank size:
This affects how many times it an experience gets trained on before it is “forgotten” by getting overwritten
with a new experience.
Learning rate and other Optimizer parameters
One of the main problems that the agent hadis that it appears to stop learning after a certainpoint. We noticed
that the optimizer had a couple other parameters that may affect how it approaches the solution. We wanted
to test if lowering the epsilon value would let it get closer to the solution. We also wanted to run more tests
with the learning rate since these two values areclosely linked.
It was evident from previous tests that learning rate played a large role in how the agent learned. We went
back to revisit some of these tests while also tuning the epsilon parameter. We thought that by changing the
lower bound of the learning rate the gradient descent may give more favorable results.
The results from these tests were inconclusive. It did not stand out from therest of the data. More tests would
be needed to draw a conclusion on whether the value of epsilon in the optimizer has a large impact on the
agent’sability to converge to a solution.
Discount rate
The discount rate allows the agent to consider expected future rewards, it could be interesting to see if
decreasing this value would yield better or worse results.
Network size
This is one of the biggest opportunities to decrease the time it takes for the agent to converge on a solution.
We started with no hidden layers, then [8, 8] layers, but we ended with [50, 40] based one of our resources
that solved this problem. Since we are doing things a little differently it would be interesting to see what kind
of results we could get when changing the size of the network. We might also change the learning rate as well
when changing the network.
Better explore policy management
One of the factors we spent the most time changing was the explore factor. We thought if we could get the
explore factor right we could find a solution. This didn’t turn out to be true, once we implemented experiment
replay with the preloading of random experiences, it mostly removed the need for exploration. We could have
made that process less painless by having a better way to switch between explore policies. The way we
structured our code, every switch between methods required a small refactor.
Configs
If our experiments didn’t take so long to run we could have been managing our experiment configurations
better. This probably would have included saving and loading configs from json, and saving these configs inside
of the Agent class so that we didn’t have needless copying of values from the config object to the Agent object
every time we created an Agent.

Reward or gradient clipping, Hubert loss functions
There are other ways to way to calculate reward and loss functions that we didn’t get to explore.
Preload with human trials
It would interesting and kind of fun to have a human playing as well, and incorporate those experiences into
the replay bank.
Conclusions
Experience Replay was Key
When we first started we thought that experience replay was just another tool to our networks be more
consistent and stable but it turned out to be a very important factor that led to the agent final being able to
complete the challenge.
Preloading experiences and e-greedy
In the e-greedy policy we used a diminishing epsilon to cause the agent to explore in the early episodes to give
it a wide enough variety of experiences that it could find the best solution. When we implement experience
replay and began preloading thebank of experiences with observations made from purely random actions, this
seemed to replace the need for e-greedy. All of the observations were made from random actions and as time
went on, the bank was overwritten with experiences taking the optimal choices. This gave the agent a similar
overall training regimen.
Neural nets were not very stable
Just because a neural network has converged on a good solution doesn’t mean it will stop changing. A neural
network may get worse over time. We think that this could be caused by the network training too much on
successful episodes, the network may start to optimize runs starting with more ideal starting states and
“forget” how to handle the more difficult starting states. The network could slowly repurpose those neurons
that make it more adaptable to be neurons that optimize ideal runs until it reaches a critical threshold where
its decreased ability to recover from bad states begins to lead to it encountering more bad states in the future.
Needed a large rolling data set of experiences
When we first started saving experiences it consisted of saving those experiences until the end of an episode
for training then. We were surprised by the amount of training required, the network must train for hours, on
millions of samples in or order to learn to play the game well.
What we did well
The way we built our neural network made it easy to change the number of hidden layers, and the size of
those layers, without having to modify the source code. Our implementation is closest to this one:
https://github.com/Seraphli/YADQN/blob/master/code/openai/LunarLander-v2/Experiment_5/evaluation.py

But we calculated loss differently so we could use a much smaller network for our solution. We also used a
much simpler e-greedy. We chose to use a small constant for epsilon instead of piecewise function to decay
the epsilon, since preloading on random experiences seems to have an equivalent effect.

Using Deep Q-Learning and Experience Replay with Lunar Lander

Recommended

Recommended

More Related Content

Similar to Using Deep Q-Learning and Experience Replay with Lunar Lander

Similar to Using Deep Q-Learning and Experience Replay with Lunar Lander (20)

More from Aakash Chotrani

More from Aakash Chotrani (7)

Recently uploaded

Recently uploaded (20)

Using Deep Q-Learning and Experience Replay with Lunar Lander