Deep reinforcement learning has got to be one of the coolest tools I have used as an engineer. Once finished, all you have to do is tell the agent what it is you want to be accomplished, and watch as the AI figures out incredibly creative ways to accomplish this task.

Understanding neural networks (explained here) is a prerequisite for deep RL. Also, deep reinforcement learning can be more easily understood if you are comfortable with tabulated reinforcement learning, which I explain in-depth here.

## Difference between Tabulated RL and Deep RL

A quick summary of tabulated RL is this.

- The agent receives the state from the environment
- Using the ε greedy strategy (or some other state exploration strategy), the agent either does a random action or the agent picks the action that will lead to the state with the highest value
- At the end of the episode, the agent stores all the states that were visited in a table. Then the agent looks back at the states visited in that episode and updates the value of each state based on rewards received

So, if we were given the true values of every state right from the get-go; this would be easy. We just make the agent always move to the state with the highest value. The main issue is finding these correct state values.

What deep RL does is; instead of keeping a table of all the values of each state we encounter, we approximate the value of all the next possible states in real-time using a neural network.

Let’s assume we already have a neural network that we can put our current state into as input and the output will be an array filled with the values of all the next states we can be in (if there is n number of possible actions then typically there are n number of next possible states we can be in). Say our environment is the atari game ‘breakout’, and our agent is in the beginning state (the first frame of the game). In breakout we can do 3 things; stay still, move left, or move right. Each of these actions will bring about a unique new state. We would input our current state into the NN and receive an array telling us the value of the state that will come from taking each action. We then take the action that leads to the highest value.

The diagram below is of a NN in an environment where there are only 2 available actions. The input is the current state, and the output is the next state values.

## A quick note on Agent Memory

Before we learn how to train our NN, it’s important to understand how we are storing the information received from playing the game. Most of the time, we are in a state S, we receive reward R, we take action A, we move to the next state S’. I say most of the time because if we are in a terminal state that ends the episode, we just receive a reward R (since there is no next state and action taken).

In practice, what we do is combine these values as a 4-tuple and store them in a list. So, it will look like this.

Now we have this information ready for when we want to train our agent.

## Training the Neural Network

So, we know we want our output to be the values of the states that result from taking each action. In order to find all the weights in our NN that will make this possible, we need to identify the loss function we want to minimize. For starters,

We want the difference between the output of our NN and the observed next state values to be 0.

If we use the mean squared error loss function, we obtain:

Now recall from tabulated RL that,