01 February 2025
Convergence of the Bellman operator is a key result behind many reinforcement learning (RL) algorithms such as Q-learning. Here, we prove this result. This note adapts this ai.stackexchange.com answer to the setting with state-action value functions and a general state space.
In the standard formulation of RL based on Markov decision processes, we have states \(s_t\) and actions \(a_t\) that go from \(t = 1\) to \(t = \infty\), a reward function \(R(s_t, a_t)\), a discount factor \(\gamma \in [0, 1)\) a dynamics model \(p_{\text{env}}(s_t \given s_{t - 1}, a_{t - 1})\) and a policy \(\pi(a_t \given s_t)\).
The state-action value function, also known as the Q-function, is the expected sum of discounted rewards assuming we start from state \(s_t\), take an action \(a_t\), and continue acting according to the policy \(\pi\) in an environment governed by the dynamics \(p_{\text{env}}\): \begin{align} Q^\pi(s_t, a_t) := \E_{p(s_{t + 1:\infty}, a_{t + 1:\infty} \given s_t, a_t)}\left[\sum_{\tau = t}^\infty \gamma^{\tau - t} R(s_\tau, a_\tau) \right] \label{eq:Q} \end{align} where \begin{align} p(s_{t + 1:\infty, a_{t + 1:\infty} \given s_t, a_t}) = \prod_{\tau = t + 1}^\infty p_{\text{env}}(s_\tau \given s_{\tau - 1}, a_{\tau - 1}) \pi(a_\tau \given s_\tau) \end{align} is the distribution over future state-action trajectories given the current state-action pair.
We can define the state-action value function recursively via the Bellman equation: \begin{align} Q^\pi(s_t, a_t) = R(s_t, a_t) + \gamma \E_{p_{\text{env}}(s_t \given s_{t - 1}, a_{t - 1}) \pi(a_t \given s_t)}\left[Q^\pi(s_{t + 1}, a_{t + 1})\right]. \label{eq:bellman} \end{align} This can be seen by substituting \eqref{eq:Q} into the RHS of \eqref{eq:bellman} and simplifying to obtain the expression in \eqref{eq:Q}.
The Bellman equation in \eqref{eq:bellman} is a fixed point of the Bellman operator \(\mathcal B^\pi\) which maps a Q-function to another Q-function: \begin{align} \mathcal B^\pi[Q](s_t, a_t) := R(s_t, a_t) + \gamma \E_{p_{\text{env}}(s_{t + 1} \given s_{t}, a_{t}) \pi(a_{t + 1} \given s_{t + 1})}\left[Q(s_{t + 1}, a_{t + 1})\right]. \label{eq:bellman_op} \end{align} That is, we define the output of the Bellman operator \(\mathcal B^\pi\) given an input Q-function \(Q\) to be an output Q-function which, when evaluated on \((s_t, a_t)\) gives the RHS of \eqref{eq:bellman_op}. The policy-specific Q-function \(Q^\pi\) is a fixed point of the Bellman operator because based on the Bellman equation \eqref{eq:bellman}, \(\mathcal B^\pi[Q^\pi] = Q^\pi\).
Convergence of the Bellman operator means that if we start from an arbitrary function over state-action pairs \(Q_1\) and repeatedly apply the Bellman operator \(Q_{i + 1} \leftarrow \mathcal B^\pi[Q_i]\) the resulting state-action value will converge to the Q-function: \(\lim_{i \to \infty} Q_i \to Q^\pi\).
Proof. We rely on the contraction mapping theorem which additionally guarantees that \(Q^\pi\) is a unique fixed point. For the contraction mapping theorem to hold, we require that \(\mathcal B^\pi\) is a contraction mapping, that is \begin{align} \left\|\mathcal B^\pi[Q] - \mathcal B^\pi[U]\right\|_\infty \leq \alpha \|Q - U\|_\infty \label{eq:contraction} \end{align} where \(Q\) and \(U\) are functions mapping from a state-action pair \((s_t, a_t)\) to a real number, \(\alpha \in [0, 1)\) is a constant and \(\|f\|_\infty = \max_x f(x)\) is the L-infinity norm of a function \(f(x)\).
To show \eqref{eq:contraction}:
\begin{align}
\text{LHS}
&= \left\|\mathcal B^\pi[Q] - \mathcal B^\pi[U]\right\|_\infty \\
&= \left\|\text{(reward term)} + \text{(} Q \text{ term)} - \text{(reward term)} - \text{(} U \text{ term)}\right\|_\infty &\text{(sub in \eqref{eq:bellman_op})} \label{eq:7}\\
&= \left\| \gamma \E_{p_{\text{env}}(s_{t + 1} \given s_{t}, a_{t}) \pi(a_{t + 1} \given s_{t + 1})}\left[Q(s_{t + 1}, a_{t + 1}) - U(s_{t + 1}, a_{t + 1})\right] \right\|_\infty &\text{(cancel reward terms)} \label{eq:8}\\
&\leq \left\| \gamma \E_{p_{\text{env}}(s_{t + 1} \given s_{t}, a_{t}) \pi(a_{t + 1} \given s_{t + 1})}\left[ \| Q - U\|_\infty \right] \right\|_\infty &\text{(} \|f\|_\infty = \max_{x’} f(x’) \geq f(x) \text{ for all } x \text{)} \label{eq:9}\\
&= \left\| \gamma \| Q - U\|_\infty \right\|_\infty &\text{(expectation of a constant function is constant)} \label{eq:10}\\
&= \gamma \| Q - U\|_\infty &\text{(norm of a constant function is constant)} \\
&= \text{RHS},
\end{align}
noting that in \eqref{eq:8}, \eqref{eq:9} and \eqref{eq:10}, we inaccurately use \(\|f(x)\|_\infty\) to mean \(\|x \mapsto f(x)\|_\infty\) for convenience.
The constant \(\alpha\) is \(\lambda\) which is in \([0,1)\) as required and determines convergence of the Bellman update to \(Q^\pi\).
\(\square\)
[back]