18 February 2025 (updated 21 March 2025)
The policy improvement theorem is a key building block of the policy iteration algorithm which itself is important for understanding actor-critic algorithms. There are many presentations of this theorem. This note tries to highlight the intuitions behind the key steps of proof.
Given the notation for Markov decision processes and the action value function, a.k.a. the Q-function \begin{align} Q^\pi(s_t, a_t) := \E_{p(s_{t + 1:\infty}, a_{t + 1:\infty} \given s_t, a_t)}\left[\sum_{\tau = t}^\infty \gamma^{\tau - t} R(s_\tau, a_\tau) \right] \label{eq:Q} \end{align} from this note, let’s define the state value function as \begin{align} V^\pi(s_t) := \E_{p(s_{t + 1:\infty}, a_{t:\infty} \given s_t, a_t)}\left[\sum_{\tau = t}^\infty \gamma^{\tau - t} R(s_\tau, a_\tau) \right] \label{eq:V}. \end{align} The state value function represents the expected total reward if we start at state \(s_t\) and follow the policy \(\pi\). The only difference from \eqref{eq:Q} is that the expectation in \eqref{eq:V} is additionally over the current action \(a_t \sim \pi(a_t \given s_t)\). Hence we can write \begin{align} V^\pi(s) = \E_{\pi(a \given s)}\left[Q^\pi(s, a)\right]. \label{eq:VQ} \end{align} That is, the state value function is the state-action value function, averaged over the policy.
The policy improvement theorem states that \begin{align} V^{\pi^\star}(s) \geq V^\pi(s) \quad \forall s, \label{eq:theorem} \end{align} where \begin{align} \pi^\star(a \given s) = \argmax_{\pi’(a \given s)} \E_{\pi’(a \given s)}\left[Q^\pi(s, a)\right]. \label{eq:argmax} \end{align} That is, the expected total reward under \(\pi^\star\) is better or equal to the expected total reward under the original policy, regardless of the starting state.
There can be many solutions that satisfy the argmax in \eqref{eq:argmax}. A common choice is a greedy policy that deterministically picks the best action based on \(Q^\pi\): \begin{align} \pi_g^\star(a \given s) := \delta_{a^\star(s)}(a), \text{ where } a^\star(s) := \argmax_a Q^\pi(s, a). \end{align} Here is a pictorial proof for why the greedy policy satisfies the argmax:
We know that \(Q^\star\) is an upper bound on the objective \(\E_{\pi'(a \given s)}[Q^\pi(s, a)]\) since it is an average over values that are \(\leq Q^\star\). This upper bound is realized by picking the greedy policy, hence this upper bound is the maximizer itself. Therefore, as an aside, for any solution \(\pi^\star\) to the argmax, we have \begin{align} \E_{\pi^\star(a \given s)}\left[Q^\pi(s, a)\right] = \max_a Q^\pi(s, a). \end{align}
As another aside, the picture also illustrates other potential choices of \(\pi^\star\). If \(Q^\pi\) has multiple modes with the same max value, \(\pi^\star\) can be any mixture over the actions corresponding to those modes.
By definition, \(\E_{\pi^\star(a \given s)}\left[Q^\pi(s, a)\right] \geq \E_{\pi'(a \given s)}\left[Q^\pi(s, a)\right]\) for any \(\pi'\). In particular, when \(\pi'\) is \(\pi\), we have \begin{align} \E_{\pi^\star(a \given s)}\left[Q^\pi(s, a)\right] \geq \E_{\pi(a \given s)}\left[Q^\pi(s, a)\right]. \label{eq:ineq} \end{align} Noting that the RHS is equal to \(V^\pi(s)\) due to \eqref{eq:VQ}, we can interpret the inequality as saying: the value of \(\pi\) is less than or equal to the value of taking the first action according to \(\pi^\star\) and subsequent actions according to \(\pi\).
Continuing this line of reasoning, taking the first \(k\) actions according to the greedy policy and subsequent actions according to the original policy should be better. Set \(k = \infty\) and we recover \eqref{eq:theorem}.
More formally, given the following shorthands
we can derive \eqref{eq:theorem} as
\begin{align}
V^\pi(s_t) &= V_t \\
&= \E_{\pi_t}[Q_t] & \text{(from \eqref{eq:VQ})} \\
&\leq \E_{\pi_t^\star}[Q_t] &\text{(from \eqref{eq:ineq})} \\
&= \E_{\pi_t^\star}\left[R_t + \gamma \E_{p_{t + 1} \pi_{t + 1}}[Q_{t + 1}]\right] &\text{(from \eqref{eq:Q})} \label{eq:16}\\
&\leq \E_{\pi_t^\star}\left[R_t + \gamma \E_{p_{t + 1} \pi_{t + 1}^\star}[Q_{t + 1}]\right] &\text{(apply \eqref{eq:ineq} again)} \\
&= \E_{\pi_t^\star p_{t + 1} \pi_{t + 1}^\star}\left[R_t + \gamma Q_{t + 1} \right] \label{eq:18}\\
&\,\,\,\vdots &\text{(repeat steps \eqref{eq:16}–\eqref{eq:18} on } Q_{t + 1}, Q_{t + 2}, \dotsc \text{)}\\
&\leq \E_{\pi_t^\star \prod_{i = 1}^k p_{t + i} \pi_{t + i}^\star}\left[\left(\sum_{j = 0}^{k - 1} \gamma^j R_{t + j}\right) + \gamma^k Q_{t + k}\right] \\
&\leq \E_{\pi_t^\star \prod_{i = 1}^\infty p_{t + i} \pi_{t + i}^\star}\left[\sum_{j = 0}^\infty \gamma^j R_{t + j}\right] & \text{(taking } k \text{ to } \infty \text{)} \\
&= V^{\pi^\star}(s_t) &\text{(from \eqref{eq:V})}
\end{align}
Here is a alternative proof which is perhaps more straightforward but doesn’t seem as intuitive to me.
We know from \eqref{eq:VQ} and \eqref{eq:ineq} that
\begin{align}
V^{\pi^\star}(s) &= \E_{\pi^\star(a \given s)}\left[Q^{\pi^\star}(s, a)\right] \\
V^\pi(s) &= \E_{\pi(a \given s)}\left[Q^{\pi}(s, a)\right] \leq \E_{\pi^\star(a \given s)}\left[Q^\pi(s, a)\right].
\end{align}
If we substitute \eqref{eq:VQ} into the Bellman’s equation for Q (eq. 3 in this note), we get
\begin{align}
Q^{\pi^\star}(s, a) &= R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[V^{\pi^\star}(s’)\right], \\
Q^\pi(s, a) &= R(s, a) - \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[V^{\pi}(s’)\right].
\end{align}
Hence,
\begin{align}
V^{\pi^\star}(s) - V^\pi(s)
&\geq \E_{\pi^\star(a \given s)}\left[Q^{\pi^\star}(s, a) - Q^\pi(s, a)\right] \\
&= \gamma \E_{\pi^\star(a \given s)p_{\text{env}}(s’ \given s, a)}\left[V^{\pi^\star}(s’) - V^{\pi}(s’)\right].
\end{align}
Applying this inequality \(k\) times, we get \begin{align} V^{\pi^\star}(s_0) - V^\pi(s_0) \geq \gamma^k \E_{p(s_{1:k}, a_{0:k - 1} \given s_0)}\left[V^{\pi^\star}(s_k) - V^{\pi}(s_k)\right] \label{eq:25} \end{align} where \begin{align} p(s_{1:k}, a_{0:k - 1} \given s_0) := \prod_{i = 0}^{k - 1} \pi^\star(a_i \given s_i)p_{\text{env}}(s_{i + 1} \given s_i, a_i). \end{align}
For finite rewards, we can upper and lower bound any value function by two constants: if the max reward is \(R_{\text{max}}\), the value function is upper bounded by \(R_{\text{max}} \sum_{t = 0}^\infty \gamma^t = R_{\text{max}} / (1 - \gamma)\), and analogously the lower bound is \(R_{\text{min}} / (1 - \gamma)\). Hence the expectation in \eqref{eq:25} can be lower and upper bounded by two constants. Taking the limit of \(k \to \infty\), the RHS of \eqref{eq:25} vanishes, \(V^{\pi^\star}(s_0) - V^\pi(s_0) \geq 0\), and we recover \eqref{eq:theorem}.
[back]