\(n\)-step Relationships for \(Q\) Function and Value Function

16 January 2018

Markov Decision Process (MDP)

An MDP with states \(S_{0:\infty}\) and actions \(A_{0:\infty}\) is specified by

an initial density \(p(s_0)\),
a transition density \(p(s_t \given s_{t - 1}), (t > 0)\), and
a policy \(\pi(s_t \given a_t), (t \geq 0)\).

Denote the reward at time \(t\), \((t \geq 0)\), by \(r_t = R(s_t, a_t)\), where \(R\) is a reward function.

The MDP trace density is \begin{align} p(s_{0:\infty}, a_{0:\infty}) &= p(s_0) \pi(a_0 \given s_0) \prod_{t = 1}^\infty \left( p(s_t \given s_{t - 1}, a_{t - 1}) \pi(a_t \given s_t) \right). \end{align}

\(Q\) Function

Let the \(Q\) function be defined as \begin{align} Q(s_t, a_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t, A_t = a_t]. \end{align}

For \(n \geq 1\), we can derive (e.g. using mathematical induction) the following relationship: \begin{align} Q(s_t, a_t) = & r_t + \\
& \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\
& \gamma^n \int Q(s_{t + n}, a_{t + n}) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n}. \end{align}

At the optimal policy \(\pi(a \given s) = \delta_{\argmax_a Q(s, a)}(a)\), we can rewrite the relationship as: \begin{align} Q(s_t, a_t) = & r_t + \\
& \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\
& \gamma^n \int \max_a Q(s_{t + n}, a) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \left( \prod_{j = t + 1}^{t + n - 1} \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n - 1}. \end{align}

Hence, for a trace \((s_t, a_t, \dotsc, s_{t + n}, a_{t + n})\), we can write \begin{align} Q(s_t, a_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n \max_a Q(s_{t + n}, a). \end{align}

Value Function

\begin{align} V(s_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t]. \end{align}

For \(n \geq 1\), we can derive (e.g. using mathematical induction) the following relationship: \begin{align} V(s_t) = & \int R(s_t, a_t) \pi(a_t \given s_t) \,\mathrm da_t + \\
& \sum_{i = 1}^{n - 1} \gamma^i \int R(s_{t + i}, a_{t + i}) \left( \prod_{j = t}^{t + i} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + i} \,\mathrm ds_{t + 1:t + i} \\
& \gamma^n \int V(s_{t + n}) \left( \prod_{j = t}^{t + n} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + n} \,\mathrm ds_{t + 1:t + n}. \end{align}

Hence, for a trace \((s_t, a_t, \dotsc, s_{t + n}, a_{t + n})\), we can write \begin{align} V(s_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n V(s_{t + n}). \end{align}

[back]

Tuan Anh Le

\(n\)-step Relationships for \(Q\) Function and Value Function

Markov Decision Process (MDP)

\(Q\) Function

Value Function