Software Development

Understanding how Deep Q-learning works.

Vivek Padia • Mar 30, 2020

Deep Q-learning is a reinforcement learning algorithm used for making human-like AI. These AI can perform different tasks based on their past experience.

DQN could be thought of as a stepping stone of learning reinforcement learning as it is easy to understand and solves many problems of this class. It is a model-based algorithm that performs best under a similar environment.

It works on three main things: State(S), Action(A), Reward(R). In any other ML problem, the target is constant and in RL target keeps changes in every iteration.

Markov Decision Process(MDP)

To solve any deep RL problem, we assume that each step solely depends on the previous state. This is called the Markov property. Processes that follow this property are called Markov Decision Processes. The process should be MDP in order to pass it into the DQN model.

Q Learning

Suppose we know the rewards for every step in the game. This gives cheatsheet to agents for every situation in-game. This collection of all rewards is called Q-table. Based on Q-table, our algorithm will try to maximize the rewards at the end of an episode.

The final reward at the end of an episode is called Q-value. Equation of Q-value is as follows:

Q(S, A) = r(S, A) + γ max Q(S’, A)

The above equation states that the Q-value yielded from state S with action A is immediate reward r(S, A) plus the highest Q-value possible from the next state S’. Gamma here is a discount factor that controls the contribution of rewards in the future. The lower value of gamma slows down the learning but increases accuracy over time.

Q(S, A) → γ Q(S’, A) + γ² Q(S’’, A) + … + γⁿ Q(Sⁿ, A)

Since this is a recursive equation, we can start by making arbitrary assumptions for all Q-values. With experience, it will converge to the optimal policy.

Q(Sₜ, Aₜ) ← Q(Sₜ, Aₜ) + α [Rₜ₊₁ + γ max ( Q(Sₜ₊₁, A) — Q(Sₜ,Aₜ)]

All these Q-values are stored in one memory buffer which will be used for further actions. At the end of the training, it will have all the required Q-values in its Q-table.

Deep Q Learning

Adding a deep neural network to Q Learning will result in the Deep Q Learning model. We use the neural network to approximate the Q-value function. The state is given as input and the Q-value of all possible actions is generated as output.

Steps for learning using Deep Q Networks (DQNs):

Initialize with random Q-table.
Count Q-values for all the possibilities using DNN.
Store Q-values in memory as Q-table.
Repeat these steps while Q-values for all the states is complete.

Pseudocode for Deep Q Learning

start with Q₀(S, A) for all S, A
get initial state S
for k = 1,2,... till convergence
       sample action A, get next state S'
       if S' is terminal:
              target = R(S, A, S')
              sample new initial state S'
       else:
              target = R(S, A, S') + γ max Qₖ(S', A')
       θₖ₊₁ ← θₖ - α ∇θ Eₛ'~ P(S'|S, A)[(Qθ(S-A) - target(S'))²]|θ=θₖ
       S ← S'

Here the target keeps changing continuously irrespective of general machine learning.

Coding example

We’ll be using OpenAI gym for showing an example of the DQN example. Please check this article on Understanding how OpenAI gym works for example.

Vivek Padia

I work with Aubergine Solution as a Machine Learning engineer. We believe in having a problem-solving attitude. I have worked with several different technologies related to ML and integrating them with cloud-based services.