Tuan Anh Le

Policy Improvement Theorem

18 February 2025

The policy improvement theorem is a key building block of the policy iteration algorithm. This note is a perhaps excessive elaboration on the standard presentations of this theorem that aims to isolate the key inequality behind it.

Given the notation for Markov decision processes and the state-action value function, a.k.a. Q-function \begin{align} Q^\pi(s_t, a_t) := \E_{p(s_{t + 1:\infty}, a_{t + 1:\infty} \given s_t, a_t)}\left[\sum_{\tau = t}^\infty \gamma^{\tau - t} R(s_\tau, a_\tau) \right] \label{eq:Q} \end{align} from this note, let’s define the state value function as \begin{align} V^\pi(s_t) := \E_{p(s_{t + 1:\infty}, a_{t:\infty} \given s_t, a_t)}\left[\sum_{\tau = t}^\infty \gamma^{\tau - t} R(s_\tau, a_\tau) \right] \label{eq:V}. \end{align} The state value function represents the expected total reward if we start at state \(s_t\) and follow the policy \(\pi\). The only difference from \eqref{eq:Q} is that the expectation in \eqref{eq:V} is additionally over the current action \(a_t \sim \pi(a_t \given s_t)\). Hence we can write \begin{align} V^\pi(s_t) = \E_{\pi(a_t \given s_t)}\left[Q^\pi(s_t, a_t)\right]. \label{eq:VQ} \end{align} That is, the state value function is the state-action value function, averaged over the policy.

The policy improvement theorem states that \begin{align} V^{\pi’}(s_t) \geq V^\pi(s_t) \quad \forall s_t, \label{eq:theorem} \end{align} where \begin{align} \pi’(a_t \given s_t) := \delta_{\argmax_a Q^\pi(s_t, a)}(a_t) \end{align} is a greedy policy that deterministically picks the action corresponding to the best Q-function of \(\pi\). That is, the expected total reward under the greedy policy is better or equal to the expected total reward under the original policy, regardless of the starting state.

The key inequality behind why this is true is that for any distribution \(p(x)\) and function \(f(x)\), \(\E_{p(x)}[f(x)] \leq \E_{p'(x)}[f(x)]\) where \(p'(x) := \delta_{x^*}(x)\) and \(x^* := \argmax_x f(x)\). Defining \(f^* := \max_x f(x)\), we show this as: \begin{align} \E_{p(x)}[f(x)] &= \int p(x) f(x) \,\mathrm dx \label{eq:6}\\
&\leq \int p(x) f^* \,\mathrm dx &\text{(\eqref{eq:7}’s integrand is always } \geq \text{ \eqref{eq:6}’s)} \label{eq:7}\\
&= f^* \\
&= \int \delta_{x^*}(x) f(x) \,\mathrm dx \\
&= \int p’(x) f(x) \,\mathrm dx \\
&= \E_{p’(x)}[f(x)]. \end{align}

If we take \(x\) to be the action \(a\), \(p(x)\) to be the policy \(\pi(a \given s)\) (distribution over \(a\) for a fixed \(s\)) and \(f(x)\) to be the state-action value function \(Q^\pi(s, a)\) (function of \(a\) for a fixed \(s\)), the above inequality becomes \begin{align} \E_{\pi(a \given s)}[Q^\pi(s, a)] \leq \E_{\pi’(a \given s)}[Q^\pi(s, a)] \label{eq:ineq} \end{align} which we can illustrate as

In other words, \(Q^*\) is the best we can do if all we can change is the distribution of the expectation (blue curve), and a distribution that achieves that is a delta mass on Q’s maximizer.

Noting that the LHS of \eqref{eq:ineq} is equal to \(V^\pi(s)\) due to \eqref{eq:VQ}, we can interpret \eqref{eq:ineq} as saying: the value of state \(s\) is less than or equal to the value of taking the first action according to the greedy policy \(\pi'(a \given s)\) and subsequent actions according to \(\pi(a \given s)\).

Continuing this line of reasoning, taking the first \(k\) actions according to the greedy policy and subsequent actions according to the original policy should be better. Set \(k = \infty\) and we recover \eqref{eq:theorem}. More formally, given the following shorthands

we can derive \eqref{eq:theorem} as \begin{align} V^\pi(s_t) &= V_t \\
&= \E_{\pi_t}[Q_t] & \text{(from \eqref{eq:VQ})} \\
&\leq \E_{\pi_t’}[Q_t] &\text{(from \eqref{eq:ineq})} \\
&= \E_{\pi_t’}\left[R_t + \gamma \E_{p_{t + 1} \pi_{t + 1}}[Q_{t + 1}]\right] &\text{(from \eqref{eq:Q})} \label{eq:16}\\
&\leq \E_{\pi_t’}\left[R_t + \gamma \E_{p_{t + 1} \pi_{t + 1}’}[Q_{t + 1}]\right] &\text{(apply \eqref{eq:ineq} again)} \\
&= \E_{\pi_t’ p_{t + 1} \pi_{t + 1}’}\left[R_t + \gamma Q_{t + 1} \right] \label{eq:18}\\
&\,\,\,\vdots &\text{(repeat steps \eqref{eq:16}–\eqref{eq:18} on } Q_{t + 1}, Q_{t + 2}, \dotsc \text{)}\\
&\leq \E_{\pi_t’ \prod_{i = 1}^k p_{t + i} \pi_{t + i}’}\left[\left(\sum_{j = 0}^{k - 1} \gamma^j R_{t + j}\right) + \gamma^k Q_{t + k}\right] \\
&\leq \E_{\pi_t’ \prod_{i = 1}^\infty p_{t + i} \pi_{t + i}’}\left[\sum_{j = 0}^\infty \gamma^j R_{t + j}\right] & \text{(taking } k \text{ to } \infty \text{)} \\
&= V_t’ &\text{(from \eqref{eq:V})}\\
&= V^{\pi’}(s_t). \end{align}

[back]