# $$n$$-step Relationships for $$Q$$ Function and Value Function

16 January 2018

## Markov Decision Process (MDP)

An MDP with states $$S_{0:\infty}$$ and actions $$A_{0:\infty}$$ is specified by

• an initial density $$p(s_0)$$,
• a transition density $$p(s_t \given s_{t - 1}), (t > 0)$$, and
• a policy $$\pi(s_t \given a_t), (t \geq 0)$$.

Denote the reward at time $$t$$, $$(t \geq 0)$$, by $$r_t = R(s_t, a_t)$$, where $$R$$ is a reward function.

The MDP trace density is \begin{align} p(s_{0:\infty}, a_{0:\infty}) &= p(s_0) \pi(a_0 \given s_0) \prod_{t = 1}^\infty \left( p(s_t \given s_{t - 1}, a_{t - 1}) \pi(a_t \given s_t) \right). \end{align}

## $$Q$$ Function

Let the $$Q$$ function be defined as \begin{align} Q(s_t, a_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t, A_t = a_t]. \end{align}

For $$n \geq 1$$, we can derive (e.g. using mathematical induction) the following relationship: \begin{align} Q(s_t, a_t) = & r_t + \\
& \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\
& \gamma^n \int Q(s_{t + n}, a_{t + n}) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n}. \end{align}

At the optimal policy $$\pi(a \given s) = \delta_{\argmax_a Q(s, a)}(a)$$, we can rewrite the relationship as: \begin{align} Q(s_t, a_t) = & r_t + \\
& \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\
& \gamma^n \int \max_a Q(s_{t + n}, a) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \left( \prod_{j = t + 1}^{t + n - 1} \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n - 1}. \end{align}

Hence, for a trace $$(s_t, a_t, \dotsc, s_{t + n}, a_{t + n})$$, we can write \begin{align} Q(s_t, a_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n \max_a Q(s_{t + n}, a). \end{align}

## Value Function

\begin{align} V(s_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t]. \end{align}

For $$n \geq 1$$, we can derive (e.g. using mathematical induction) the following relationship: \begin{align} V(s_t) = & \int R(s_t, a_t) \pi(a_t \given s_t) \,\mathrm da_t + \\
& \sum_{i = 1}^{n - 1} \gamma^i \int R(s_{t + i}, a_{t + i}) \left( \prod_{j = t}^{t + i} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + i} \,\mathrm ds_{t + 1:t + i} \\
& \gamma^n \int V(s_{t + n}) \left( \prod_{j = t}^{t + n} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + n} \,\mathrm ds_{t + 1:t + n}. \end{align}

Hence, for a trace $$(s_t, a_t, \dotsc, s_{t + n}, a_{t + n})$$, we can write \begin{align} V(s_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n V(s_{t + n}). \end{align}

[back]