Tuan Anh Le

Deep Q Network, Policy Gradients, Advantage Actor Critic

13 January 2018

This note draws from:

Markov Decision Process (MDP)

An MDP with states and actions is specified by

Denote the reward at time , , by , where is a reward function.

The MDP trace density is \begin{align} p(s_{0:\infty}, a_{0:\infty}) &= p(s_0) \pi(a_0 \given s_0) \prod_{t = 1}^\infty \left( p(s_t \given s_{t - 1}, a_{t - 1}) \pi(a_t \given s_t) \right). \end{align}

Deep Q Network (DQN)

There is a proof of convergence of the -learning algorithm for discrete spaces here. Forget about that in the deep reinforcement learning setting.

A function defined as \begin{align} Q(s_t, a_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t, A_t = a_t] \\ &= \int (r_t + \gamma r_{t + 1} + \cdots) p(s_{t + 1} \given s_t, a_t) \prod_{\tau = t + 1}^\infty \left( \pi(a_\tau \given s_\tau) \cdot p(s_{\tau + 1} \given s_\tau, a_\tau) \right) \,\mathrm ds_{t + 1:\infty} \,\mathrm a_{t + 1:\infty} \\ &= r_t + \gamma \int (r_{t + 1} + \gamma r_{t + 2} + \cdots) p(s_{t + 1} \given s_t, a_t) \prod_{\tau = t + 1}^\infty \left( \pi(a_\tau \given s_\tau) \cdot p(s_{\tau + 1} \given s_\tau, a_\tau) \right) \,\mathrm ds_{t + 1:\infty} \,\mathrm a_{t + 1:\infty}. \label{eq:dqn-q} \end{align}

Now, note that for , the function is \begin{align} Q(s_{t + 1}, a_{t + 1}) &= \int (r_{t + 1} + \gamma r_{t + 2} + \cdots) p(s_{t + 2} \given s_{t + 1}, a_{t + 1}) \prod_{\tau = t + 2}^\infty \left( \pi(a_\tau \given s_\tau) \cdot p(s_{\tau + 1} \given s_\tau, a_\tau) \right) \,\mathrm ds_{t + 2:\infty} \,\mathrm a_{t + 2:\infty}. \end{align}

We use this in \eqref{eq:dqn-q} to obtain: \begin{align} Q(s_t, a_t) &= r_t + \gamma \int Q(s_{t + 1}, a_{t + 1}) p(s_{t + 1} \given s_t, a_t) \pi(a_{t + 1} \given s_{t + 1}) \,\mathrm ds_{t + 1} \,\mathrm da_{t + 1}. \label{eq:dqn-q-2} \end{align}

At the optimal policy, \begin{align} \pi^\ast(a \given s) = \delta_{\argmax_{a’} Q(s, a’)}(a), \end{align} where is a delta mass at . Hence, at the optimal policy, we can write \eqref{eq:dqn-q-2} as follows: \begin{align} Q(s_t, a_t) &= r_t + \gamma \int \max_a Q(s_{t + 1}, a) p(s_{t + 1} \given s_t, a_t) \,\mathrm ds_{t + 1} \\ &= r_t + \gamma \E_{p(s_{t + 1} \given s_t, a_t)}[\max_a Q(s_{t + 1}, a)]. \end{align}

Hence, for random variables sampled from the MDP trace density, we would like the value of \begin{align} L_Q(S_t, A_t, R(S_t, A_t), S_{t + 1}) &= \left(Q(S_t, A_t) - (R(S_t, A_t) + \gamma \cdot \max_a Q(S_{t + 1}, a))\right)^2 \end{align} to be as small as possible.

In DQN, the -network is parameterized by . Given a sample from the marginal trace distribution , we can perform a stochastic gradient descent step to minimize using the gradient .

Algorithm

Experiments

Performance of DQN on CartPole:

Python script for generating this figure.

Policy Gradients (PG)

In policy gradients, we parameterize the policy by parameters . There is a policy network which typically maps from a state to the parameters of a distribution over actions.

We maximize the expected discounted rewards with respect to directly using stochastic gradient ascent. The gradient can be rewritten as \begin{align} \nabla_\theta \E\left[ \sum_{t = 0}^T \gamma^t r_t \right] &= \nabla_\theta \int \left(\sum_{t = 0}^T \gamma^t r_t\right) p(s_{0:T}, a_{0:T}) \,\mathrm ds_{0:T} \,\mathrm da_{0:T} \\ &= \int \left(\sum_{t = 0}^T \gamma^t r_t\right) \left( p(s_0) \prod_{t = 1}^T p(s_t \given s_{t - 1}, a_{t - 1}) \right) \left( \nabla_\theta \prod_{t = 0}^T \pi_\theta(a_t \given s_t) \right) \,\mathrm ds_{0:T} \,\mathrm da_{0:T} \\ &= \int \left(\sum_{t = 0}^T \gamma^t r_t\right) \left( \nabla_\theta \sum_{t = 0}^T \log \pi_\theta(a_t \given s_t) \right) p(s_{0:T}, a_{0:T}) \,\mathrm ds_{0:T} \,\mathrm da_{0:T} \\ &= \E\left[ \left(\sum_{t = 0}^T \gamma^t r_t\right) \left( \nabla_\theta \sum_{t = 0}^T \log \pi_\theta(a_t \given s_t) \right) \right]. \label{eq:pg} \end{align}

Hence, we can get an unbiased estimate of the gradient by sampling and evaluating .

Algorithm

Experiments

Performance of Policy Gradients on CartPole:

Python script for generating this figure.

Advantage Actor Critic (A2C)

We are going to use the following fact: \begin{align} \E[\nabla_\theta \log \pi_\theta (a_t \given s_t)] &= \int \nabla_\theta \log \pi_\theta(a_t \given s_t) \pi_\theta(a_t \given s_t) \,\mathrm da_t \\ &= \int \nabla_\theta \pi_\theta(a_t \given s_t) \,\mathrm da_t \\ &= \nabla_\theta \int \pi_\theta(a_t \given s_t) \,\mathrm a_t \\ &= \nabla_\theta 1 \\ &= 0. \label{eq:a2c-zero-grad} \end{align}

Starting from \eqref{eq:pg}, we have: \begin{align} \nabla_\theta \E\left[ \sum_{t = 0}^T \gamma^t r_t \right] &= \E\left[ \left(\sum_{t = 0}^T \gamma^t r_t\right) \left( \nabla_\theta \sum_{t = 0}^T \log \pi_\theta(a_t \given s_t) \right) \right] \\ &= \sum_{t = 0}^T \E\left[ \left(\sum_{\tau = 0}^T \gamma^\tau r_\tau\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right] \\ &= \sum_{t = 0}^T \left( \E\left[ \left(\sum_{\tau = 0}^{t - 1} \gamma^\tau r_\tau\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right] + \E\left[ \left(\sum_{\tau = t}^T \gamma^\tau r_\tau\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right]\right) \\ &= \sum_{t = 0}^T \left( \E_{p(s_{0:t - 1}, a_{0:t - 1}) p(s_t \given s_{t - 1}, a_{t - 1})}\left[\left(\sum_{\tau = 0}^{t - 1} \gamma^\tau r_\tau\right) \E_{\pi_\theta(a_t \given s_t)}\left[\nabla_\theta \log \pi_\theta(a_t \given s_t)\right]\right] + \E\left[ \left(\sum_{\tau = t}^T \gamma^\tau r_\tau\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right]\right) \\ &= \sum_{t = 0}^T \E\left[ \left(\sum_{\tau = t}^T \gamma^\tau r_\tau\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right] && \text{(use \eqref{eq:a2c-zero-grad})} \\ &= \sum_{t = 0}^T \E\left[ \left(\left(\sum_{\tau = t}^T \gamma^\tau r_\tau\right) - b(s_t)\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right] && \text{(use \eqref{eq:a2c-zero-grad})}, \label{eq:a2c} \end{align} where is called a baseline. The more is correlated with , the lower the variance of the gradient estimator.

Approximation of the Value Function

In A2C, is usually set to an approximation to the value function which is defined as \begin{align} V(s_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t]. \end{align}

Let be a function parameterized by . Then, using the -step relationship for value function (eq. 14 in here), it can be optimized by minimizing the following loss with respect to : \begin{align} L_V(s_t, r_t, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n}) = \left( \left(\sum_{k = 0}^{n - 1} \gamma^k r_{t + k}\right) + \gamma^n V^-(s_{t + n}) - V(s_t) \right)^2, \label{eq:loss-v} \end{align} where the minus superscript denotes detaching from the computation graph. This is optimized in parallel with the optimization of the policy parameters .

-step A2C

Starting from \eqref{eq:a2c}, we derive the -step A2C update as described in Algorithm S3 of the original A3C paper:

\begin{align} &\sum_{t = 0}^T \E\left[ \left(\left(\sum_{\tau = t}^T \gamma^\tau r_\tau\right) - V(s_t)\right) \nabla_\theta \log \pi_\theta(a_t \given s_t) \right] \\ &\approx \sum_{\color{red}{i = 0}}^{\color{red}{n - 1}} \E\left[ \left(\left(\sum_{\tau = \color{red}{i}}^T \gamma^\tau r_\tau \right) - V(s_{\color{red}{i}})\right) \nabla_\theta \log \pi_\theta(a_{\color{red}{i}} \given s_{\color{red}{i}}) \right] && \text{(sum to } (n - 1) \text{ instead of } T \text{)} \\ &\approx \sum_{i = 0}^{n - 1} \E\left[\left(\left(\sum_{\tau = \color{red}{t + i}}^T \gamma^\tau r_\tau\right) - V(s_{\color{red}{t + i}})\right) \nabla_\theta \log \pi_\theta(a_{\color{red}{t + i}} \given s_{\color{red}{t + i}})\right] && \text{(sum from } t \text{ instead of } 0 \text{)} \\ &\approx \sum_{i = 0}^{n - 1} \E\left[\left(\left(\sum_{\tau = t + i}^T \gamma^{\tau \color{red}{- (t + i)}} r_\tau\right) - V(s_{t + i})\right) \nabla_\theta \log \pi_\theta(a_{t + i} \given s_{t + i})\right] && \text{(start discounting from } 0 \text{ instead of } (t + i) \text{)}\\ &\approx \sum_{i = 0}^{n - 1} \E\left[\left(\left(\sum_{\tau = t + i}^{\color{red}{t + n - 1}} \gamma^{\tau - (t + i)} r_\tau\right) \color{red}{+ \gamma^{n - i} V(s_{t + n})} - V(s_{t + i})\right) \nabla_\theta \log \pi_\theta(a_{t + i} \given s_{t + i})\right] && \text{(approximate tail of the inner sum by the value function)} \\ &\approx \sum_{i = 0}^{n - 1} \E\left[\left(\left(\sum_{\color{red}{j = 0}}^{\color{red}{n - i - 1}} \gamma^{\color{red}{j}} r_{\color{red}{t + i + j}}\right) + \gamma^{n - i} V(s_{t + n}) - V(s_{t + i})\right) \nabla_\theta \log \pi_\theta(a_{t + i} \given s_{t + i})\right] && \text{(change inner index from } \tau \text{ to } j \text{)}. \label{eq:n-step-a2c} \end{align}

Noting the similar form of \eqref{eq:loss-v} and \eqref{eq:n-step-a2c}, we define the so-called advantage function, parameterized by : \begin{align} A(s_{t + i}, r_{t + i}, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n}) &= \left(\sum_{j = 0}^{n - i - 1} \gamma^{j} r_{t + i + j}\right) + \gamma^{n - i} V^-(s_{t + n}) - V(s_{t + i}). \end{align} Hence, given , the policy and value-function losses are: \begin{align} L_\pi(s_t, r_t, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n}) &= -\sum_{i = 0}^{n - 1} A(s_{t + i}, r_{t + i}, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n})^- \log \pi_\theta(a_{t + i} \given s_{t + i}) \\ L_V(s_t, r_t, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n}) &= \sum_{i = 0}^{n - 1} A(s_{t + i}, r_{t + i}, \dotsc, s_{t + n - 1}, r_{t + n - 1}, s_{t + n})^2. \end{align}

Implementations:

Algorithm

Experiments

Performance of A2C on CartPole (doesn’t work yet; TODO: make it work):

Python script for generating this figure.

[back]