21 March 2025
Bellman optimality comprises a set of claims that are key to the value iteration algorithm which is crucial for understanding Q-learning algorithms. Typically, Bellman optimality is presented in terms of the state-value function V. In this note, we present and prove the claims in terms of the action-value function Q, making it easier to connect the results to Q-learning.
We use notation from previous notes, Proof of Convergence of the Bellman Operator and Policy Improvement Theorem. We draw on ideas from Chapter 3 of this book.
There exists a unique function \(Q^\star(s, a)\) that satisfies \begin{align} Q^\star(s, a) = R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\max_{a’} Q^\star(s’, a’)\right]. \label{eq:bellman_optimality} \end{align} We call \eqref{eq:bellman_optimality} the Bellman optimality equation and \(Q^\star(s, a)\) the optimal action-value function.
The Bellman optimality operator \(\mathcal B^\star[Q]\) is given by \begin{align} \mathcal B^\star[Q](s, a) := R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\max_{a’} Q(s’, a’)\right] \label{eq:operator} \end{align} and if we set \(Q_k \leftarrow \mathcal B^\star[Q_{k - 1}]\) for \(k = 1, 2, \dotsc\), \(Q_k\) converges to \(Q^\star\) regardless of the initial \(Q_0\).
The policy \(\pi^\star(a \given s)\) given by \begin{align} \pi^\star(a \given s) = \argmax_{\pi(a \given s)} \E_{\pi(a \given s)}\left[Q^\star(s, a)\right] \label{eq:pi_star} \end{align} is an optimal policy in the sense that \begin{align} V^{\pi^\star}(s) \geq V^\pi(s) \label{eq:opt} \end{align} for any policy \(\pi\) and state \(s\). Its action-value function \(Q^{\pi^\star}\) is equal to \(Q^\star\).
As an aside, the greedy policy defined as \begin{align} \pi_g^\star(a \given s) := \delta_{a^\star(s)}(a), \text{ where } a^\star(s) := \argmax_a Q^\star(s, a) \label{eq:pi_g} \end{align} is one solution to the argmax in \eqref{eq:pi_star} but there can be more solutions. See section “What is \(\pi^\star\)” of Policy Improvement Theorem for more discussion.
The claims
directly follow from the contraction mapping theorem which holds if the Bellman optimality operator is a contraction mapping.
Claim 1. The Bellman optimality operator \(\mathcal B^\star\) is a contraction mapping. That is \begin{align} \left\|\mathcal B^\star[Q] - \mathcal B^\star[U]\right\|_\infty \leq \alpha \|Q - U\|_\infty \label{eq:contraction} \end{align} where \(Q\) and \(U\) are functions mapping from a state-action pair \((s, a)\) to a real number, \(\alpha \in [0, 1)\) is a constant and \(\|f\|_\infty = \max_x f(x)\) is the L-infinity norm of a function \(f(x)\).
To prove this, we’ll make use of the following identity.
Claim 2. For any two real-valued functions \(f(x), g(x)\), \(\max_x f(x) - \max_x g(x) \leq \max_x (f(x) - g(x))\).
Proof of Claim 2. Let \(x^\star = \argmax_x f(x)\). Then
\begin{align}
\text{LHS} &= f(x^\star) - \max_x g(x) \\
&\leq f(x^\star) - g(x^\star) \\
&\leq \text{RHS}.
\end{align}
\(\square\)
Proof of Claim 1.
To show \eqref{eq:contraction},
\begin{align}
\text{LHS}
&= \left\|\text{(reward term)} + \text{(} Q \text{ term)} - \text{(reward term)} - \text{(} U \text{ term)}\right\|_\infty &\text{(sub in \eqref{eq:bellman_optimality})} \\
&= \left\| \gamma \E_{p_{\text{env}}(s’ \given s, a) }\left[\left(\max_{a’} Q(s’, a’)\right) - \left(\max_{a’} U(s’, a’)\right)\right] \right\|_\infty &\text{(cancel reward terms)} \label{eq:start}\\
&\leq \left\| \gamma \E_{p_{\text{env}}(s’ \given s, a) }\left[\max_{a’} \left(Q(s’, a’) - U(s’, a’)\right)\right] \right\|_\infty &\text{(use Claim 2)} \\
&\leq \left\| \gamma \E_{p_{\text{env}}(s’ \given s, a) }\left[\max_{s’’} \max_{a’} \left(Q(s’’, a’) - U(s’’, a’)\right)\right] \right\|_\infty &\text{(}\max_x f(x) \geq f(x’) \text{ for all } x’\text{)} \\
&= \left\| \gamma \E_{p_{\text{env}}(s’ \given s, a) }\left[\max_{(s’’, a’)} \left(Q(s’’, a’) - U(s’’, a’)\right)\right] \right\|_\infty &\text{(}\max_x \max_y f(x, y) = \max_{(x, y)} f(x, y)\text{)} \\
&= \left\| \gamma \E_{p_{\text{env}}(s’ \given s, a) }\left[ \| Q - U\|_\infty \right] \right\|_\infty &\text{(definition of the norm)} \\
&= \left\| \gamma \| Q - U\|_\infty \right\|_\infty &\text{(expectation of a constant function is constant)} \label{eq:end}\\
&= \gamma \| Q - U\|_\infty &\text{(norm of a constant function is constant)} \\
&= \text{RHS},
\end{align}
noting that in \eqref{eq:start}–\eqref{eq:end}, we inaccurately use \(\|f(x)\|_\infty\) to mean \(\|x \mapsto f(x)\|_\infty\) for convenience.
The constant \(\alpha\) is \(\lambda\) which is in \([0,1)\) as required and determines convergence of the Bellman update to \(Q^\star\).
\(\square\)
Claim 3. The policy \(\pi^\star\) is optimal.
Proof of Claim 3.
Pick an arbitrary policy \(\pi\) and state \(s\).
We will show \eqref{eq:opt} following an argument similar to one in the “Alternative Proof” section of the Policy Improvement Theorem note.
Recall that
\begin{align}
V^{\pi^\star}(s) &= \E_{\pi^\star(a \given s)} \left[Q^{\pi^\star}(s, a)\right] \\
&= \max_a Q^{\pi^\star}(s, a) &\text{(via \eqref{eq:pi_star})}\\
&\geq \E_{\pi(a \given s)}\left[Q^{\pi^\star}(s, a)\right], \\
V^\pi(s) &= \E_{\pi(a \given s)}\left[Q^\pi(s, a)\right]
\end{align}
and
\begin{align}
Q^{\pi^\star}(s, a) &= R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[V^{\pi^\star}(s’)\right], \\
Q^\pi(s, a) &= R(s, a) - \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[V^{\pi}(s’)\right].
\end{align}
Hence
\begin{align}
V^{\pi^\star}(s) - V^\pi(s)
&\geq \E_{\pi^\star(a \given s)}\left[Q^{\pi^\star}(s, a) - Q^\pi(s, a)\right] \\
&= \gamma \E_{\pi^\star(a \given s)p_{\text{env}}(s’ \given s, a)}\left[V^{\pi^\star}(s’) - V^{\pi}(s’)\right].
\end{align}
Applying this inequality \(k\) times, we get \begin{align} V^{\pi^\star}(s_0) - V^\pi(s_0) \geq \gamma^k \E_{p(s_{1:k}, a_{0:k - 1} \given s_0)}\left[V^{\pi^\star}(s_k) - V^{\pi}(s_k)\right] \label{eq:25} \end{align} where \begin{align} p(s_{1:k}, a_{0:k - 1} \given s_0) := \prod_{i = 0}^{k - 1} \pi^\star(a_i \given s_i)p_{\text{env}}(s_{i + 1} \given s_i, a_i). \end{align}
For finite rewards, we can upper and lower bound any value function by two constants: if the max reward is \(R_{\text{max}}\), the value function is upper bounded by \(R_{\text{max}} \sum_{t = 0}^\infty \gamma^t = R_{\text{max}} / (1 - \gamma)\), and analogously the lower bound is \(R_{\text{min}} / (1 - \gamma)\). Hence the expectation in \eqref{eq:25} can be lower and upper bounded by two constants. Taking the limit of \(k \to \infty\), the RHS of \eqref{eq:25} vanishes, \(V^{\pi^\star}(s_0) - V^\pi(s_0) \geq 0\), and we recover \eqref{eq:opt}. \(\square\)
Claim 4. The optimal policy’s action-value function coincides with the optimal action-value function, \(Q^{\pi^\star}(s, a) = Q^\star(s, a)\).
Proof of Claim 4. Recall from section “What is \(\pi^\star\)” of Policy Improvement Theorem that any optimal policy \(\pi^\star\) satisfies \begin{align} \E_{\pi^\star(a \given s)}\left[Q^\star(s, a)\right] = \max_a Q^\star(s, a). \label{eq:max} \end{align}
We also know that
\begin{align}
Q^{\pi^\star}(s, a)
&= R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\E_{\pi^\star(a’ \given s’)}\left[Q^{\pi^\star}(s’, a’)\right]\right] &\text{(Bellman equation for Q)},\\
Q^\star(s, a) &= R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\max_{a’} Q^{\star}(s’, a’)\right] &\text{(from \eqref{eq:bellman_optimality})}\\
&= R(s, a) + \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\E_{\pi^\star(a’ \given s’)}\left[Q^{\star}(s’, a’)\right]\right] &\text{(from \eqref{eq:max})}.
\end{align}
Hence \begin{align} Q^{\pi^\star}(s, a) - Q^\star(s, a) &= \gamma \E_{p_{\text{env}}(s’ \given s, a)}\left[\E_{\pi^\star(a’ \given s’)}\left[Q^{\pi^\star}(s’, a’) - Q^{\star}(s’, a’)\right]\right]. \label{eq:recursion} \end{align}
Following a similar argument to one in the proof of Claim 3, we apply \eqref{eq:recursion} to itself \(k\) times, and take the limit of \(k \to \infty\), obtaining \(Q^{\pi^\star}(s, a) - Q^\star(s, a) = 0\) and hence \(Q^{\pi^\star}(s, a) = Q^\star(s, a)\). \(\square\)
[back]