This article focuses on how I used a Neural Network and concepts that I learned from Supervised Learning in a Reinforcement Learning task. The project uses a simple NN trained by using a heuristic instead of a known output since we do not know an exact expected output here. The model is based on vision and a game screenshot is the only input given to the model. The complete code for the project can be found here.

The algorithm playing the game after learning from only 15 games. This was the first time it sees a bird and still guessed that it needs to jump for low flying birds but not for high flying ones!

Motivation and Goal

The idea of having a program learn to play a game all by itself seemed really interesting to me and I wanted to try it out myself. One way to achieve this is to find a good model from a space of possible models using a search method such as a genetic algorithm. But this doesn’t scale well to non-trivial games or for higher dimensional models as the program will not be learning but will be trying out every possible model in hope for finding a good one. So I wanted to work on an algorithm which actually tries to learn, and hence I decided to take on this project.

Reinforcement Learning is a broad field and has a lot of ongoing development and research. As a learning experience, I wanted to start from scratch and implement something on my own instead of using a known algorithm directly, similar to how I implemented my own CNN based on YOLO regression concepts in my real-time face detection with attribute classification project. So I decided to experiment with the concepts that I learned from Supervised Learning and try to apply them here.

Code-Game Interaction

The game is opened in an actual browser and a GUI automation framework pyautogui is used to take screenshots as well as to control the mouse and keyboard to play the game.

The program uses image based template matching using OpenCV to detect the game location on the screen. A similar method is used to figure out when the game is over.


As mentioned above, the goal of the project is to try to use my supervised learning experience by adapting it for reinforcement learning. So, the used NN is a simple fully connected feed forward network with a single hidden layer and is trained using Adam Optimizer in exactly the same way as we do for any supervised learning task. The only difference here would be the expected outputs.

Unlike supervised learning tasks where we had labeled data and expected output was known, here we need the model to figure out the correct output by itself. In order to help the model achieve that, we will be guiding it in the correct direction. So instead of correct expected output, we will use a heuristic to figure out what outputs should be used to train the model.

Activation function

The model uses tanh activation for hidden layer and softmax for the output layer (since these are probabilities as explained below). There is no specific reason to use tanh, it just seemed right to me and happened to work well.

Model Inputs and Outputs

The model takes a game screenshot as the only input and model outputs the probabilities of 3 possible moves: jump, duck, do nothing (keep running). We continuously take a screenshot of the game and give it to the model and read its output. Whichever move has the maximum probability is the move that the model chose to take at that particular moment.

Data collection for training

During each game, we store a list of all inputs given to the network and moves that the model makes (up to last 3000 input outputs). The model is trained using this data after every 2 games.


This is the key part of the algorithm. We need to figure out how to train our model without any known expected outputs. If we carefully think about it, we can safely assume that the only wrong moves that the model made are among the last few moves (due to the linear nature of the game itself). Since our model survived till few seconds before the end, we can say that all the moves till this point were somewhat valid, since the end of the game doesn’t depend on previous moves made, except the last few moves.

We use this to our advantage. Since we agree with previous moves that the model made, we can safely use them as training data to reinforce the model to keep making these moves. So we take the moves that the model made and reinforce them by using same moves as expected output but with high probability.

In order to account for uncertainty, I did not use probability 1 for training and used 0.9 instead. The rest 0.1 was divided into other 2 moves (0.05 each). This is done as there is a small chance that the model just got here by luck instead of correct moves. For example, what if the model is always just jumping irrespective of input which happened to avoid all obstacles? Hence, we do not want to reinforce it too strongly.

Now we know correct moves and reinforcing them but that is not enough to train the model, as discouraging the model from making bad moves is more important as our model was already almost right for rest of the moves anyway. But the problem is that we don’t know what the correct moves should have been here. But we do know the wrong moves, so we use what we have. We will discourage the wrong moves by training with very low probability and we will divide the rest of probability among other moves (since we don’t know which of them is correct). Same as above, I used 0.1 for low probability instead of 0 as we are not sure that every move in the last few frames are wrong and hence don’t want to strongly discourage them. The rest of 0.9 is divided into other moves (0.45 each).

This is the main logic but in actual code, I divided my last few outputs (that we assumed are wrong) into 2 parts: the small set of last moves just before dying which are more probable to contribute to the death, and the rest of the moves which we think are wrong. I use higher weights (in loss function) for last moves to discourage them more strongly and low weights for rest of the wrong moves.

Training and loss function

Now we have both input data, the moves made by the model and the correct move probabilities (using heuristics). Now we train the model using backpropagation with Adam Optimizer. For loss function, I am using mean square error as a loss function. Since the output is probabilities, I could have used log loss but I intentionally avoided it as it imposes a very high penalty for incorrect probabilities. This will actually be a disadvantage for us as some of the outputs based on our heuristics are not very specific and also might be wrong. By using mean squared error, even if our heuristic method is wrong for some inputs, we are not imposing a very high penalty.


We are passing an image to the algorithm and it doesn’t even know what pixels mean and still, the algorithm does figure out the game rules by itself and is able to play the game (better than me).

Below is an image of snippets from a game that algorithm played. We can see that the algorithm figured out that it needs to jump over cactus and low flying birds. It even figured out that it needs to duck to avoid high flying birds. Although for some reason, it thinks that it is a good idea to always duck while running.

Snippets from the game played by our NN. It figured out that it needs to jump over cactus and birds and even learned that ducking is enough to avoid high flying birds.

Snippets from the game played by our NN. It figured out that it needs to jump over cactus and birds and even learned that ducking is enough to avoid high flying birds.

When the algorithm sees a high flying bird for the first time, I anticipated it to make a mistake and jump since I expected it to learn a simple rule that jumps whenever something is in front. Interestingly, it didn’t jump as if it knows how to avoid it (it likes to duck by default so it was already ducking).

The algorithm also keeps up with increasing speed until the game is fast enough for it to land on another cactus or bird whenever it jumps to avoid one. This is due to the fact that the model doesn’t actually know about the speed at all! All it gets is a screenshot of the game, so it is trying to make a decision based on image and not the actual game. See next section for my ideas to fix this.

Conclusion and Next Steps

This was an interesting project and a good learning exercise for me. The method used is quite simple and makes some assumption. The biggest drawback of this approach is that we need to know (at least approximately) which moves are the wrong ones, as we did in this project where last few frames will most likely be responsible for game-over.

Overall this project proves that by using a simple understanding of the game and some heuristics, we can train a NN to play simple games by using normal backpropagation as we do for any other supervised learning tasks.

Other than finding bad moves, most games have another challenge. In most complicated games we usually need to explore the area to find a solution and hence we also need a method that teaches the algorithm to explore unknown moves or areas in a systematic and balanced way. Unlike this project where we have only 3 moves and 1 of them is always right, most games are not that straight forward.

For this project, there are few things that we can change or try out to improve our algorithm or code, some of them are:

  • Incorporate speed. Currently, the model makes a decision based solely on a screenshot and hence has no way to know that the game has an increasing speed. This kills the model when it jumps to avoid an obstacle but lands on another. We need to give speed cues to the model and one simple way to do that would be to pass the last 2 frames as input, or difference of the last 2 frames. That way model has sufficient data and has a chance to learn about speed.
  • keeping a copy of the best model, as sometimes model degrades when it starts trying a new strategy. This way we can restart learning using this point as a checkpoint.
  • Using data from many games instead of just 2 games before training the existing model, especially after the model has learned to play to some extent. This will also avoid degradation that is mentioned in the above point as the model will have more data to consider before it comes up with new strategy. An ideal way would be to initially train model after every game but gradually start increasing number of games before training.

The complete code for the project can be found in my GitHub here.