Tuan Anh Le

-step Relationships for Function and Value Function

16 January 2018

Markov Decision Process (MDP)

An MDP with states and actions is specified by

Denote the reward at time , , by , where is a reward function.

The MDP trace density is \begin{align} p(s_{0:\infty}, a_{0:\infty}) &= p(s_0) \pi(a_0 \given s_0) \prod_{t = 1}^\infty \left( p(s_t \given s_{t - 1}, a_{t - 1}) \pi(a_t \given s_t) \right). \end{align}

Function

Let the function be defined as \begin{align} Q(s_t, a_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t, A_t = a_t]. \end{align}

For , we can derive (e.g. using mathematical induction) the following relationship: \begin{align} Q(s_t, a_t) = & r_t + \\ & \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\ & \gamma^n \int Q(s_{t + n}, a_{t + n}) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n}. \end{align}

At the optimal policy , we can rewrite the relationship as: \begin{align} Q(s_t, a_t) = & r_t + \\ & \left(\sum_{i = 1}^{n - 1} \gamma^i \int r_{t + 1} \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + i} \,\mathrm da_{t + 1:t + i}\right) + \\ & \gamma^n \int \max_a Q(s_{t + n}, a) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \left( \prod_{j = t + 1}^{t + n - 1} \pi(a_j \given s_j) \right) \,\mathrm ds_{t + 1:t + n} \,\mathrm da_{t + 1:t + n - 1}. \end{align}

Hence, for a trace , we can write \begin{align} Q(s_t, a_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n \max_a Q(s_{t + n}, a). \end{align}

Value Function

\begin{align} V(s_t) &= \E[r_t + \gamma r_{t + 1} + \cdots \given S_t = s_t]. \end{align}

For , we can derive (e.g. using mathematical induction) the following relationship: \begin{align} V(s_t) = & \int R(s_t, a_t) \pi(a_t \given s_t) \,\mathrm da_t + \\ & \sum_{i = 1}^{n - 1} \gamma^i \int R(s_{t + i}, a_{t + i}) \left( \prod_{j = t}^{t + i} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + i} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + i} \,\mathrm ds_{t + 1:t + i} \\ & \gamma^n \int V(s_{t + n}) \left( \prod_{j = t}^{t + n} \pi(a_j \given s_j) \right) \left( \prod_{j = t + 1}^{t + n} p(s_j \given s_{j - 1}, a_{j - 1}) \right) \,\mathrm da_{t:t + n} \,\mathrm ds_{t + 1:t + n}. \end{align}

Hence, for a trace , we can write \begin{align} V(s_t) = \left(\sum_{i = 0}^{n - 1} \gamma^i r_{t + i}\right) + \gamma^n V(s_{t + n}). \end{align}

[back]