# Planning by dynamic programming for reinforcement learning

08 November 2016

These are notes of lecture 3 of David Silver’s course on reinforcement learning.

Dynamic programming (DP) methods for reinforcement learning (RL) are pretty useless but provide a foundation for understanding the state-of-the-art methods.

We use this notation and vector representation of functions of finite domain (e.g. $$v_{\pi}: \mathcal S \to \mathbb R$$ is represented by a $$\left\vert\mathcal S\right\vert$$-dimensional vector).

## Policy evaluation

Goal: Given an MDP $$(\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)$$ and a policy $$\pi$$, find $$v_{\pi}$$.

Algorithm:

1. Initialize $$v_0$$.
2. For $$k = 1, 2, \dotsc$$: \begin{align} v_k \leftarrow \mathcal R^{\pi} + \gamma \mathcal P^{\pi} v_{k - 1}. \end{align}

This comes from the Bellman expectation equation. Can be proved to converge to $$v_\pi$$.

## Policy iteration

Goal: Given an MDP $$(\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)$$, find the optimal policy $$\pi_*$$.

Algorithm:

1. Initialize $$v_0, \pi_0$$.
2. For $$k = 1, 2, \dotsc$$: \begin{align} v_k &\leftarrow \mathrm{IterativePolicyEvaluation}(\pi_{k - 1}, v_{k - 1}) = v_{\pi_{k - 1}} \\
\pi_k &\leftarrow \mathrm{Greedy}(v_k). \end{align}

$$\mathrm{IterativePolicyEvaluation}$$ is an algorithm (like the one from the previous section) to obtain the optimal value function given the MDP and the policy from previous iteration $$\pi_{k - 1}$$. $$\mathrm{Greedy}(v_k)$$ means \begin{align} \pi_k(a | s) = \begin{cases} 1 & \text{if } a = \argmax_{a’} \left\{ q_k(s, a’) = \mathcal R_s^{a’} + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^{a’} v_k(s’) \right\} \\
0 & \text{otherwise.}\\
\end{cases} \end{align}

Proven to converge to $$\pi_*$$ via Contraction Mapping Theorem.

## Policy improvement

Goal: Prove why value function monotonically increases in policy iteration.

\begin{align} v_{k - 1}(s) = \cdots \leq \cdots = v_k(s) \end{align}

## Value iteration

Goal: Given an MDP $$(\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)$$, find the optimal policy $$\pi_*$$.

Algorithm:

1. Initialize $$v_0$$.
2. For $$k = 1, 2, \dotsc$$: \begin{align} v_k \leftarrow \max_{a \in \mathcal A} \left(\mathcal R^a + \gamma \mathcal P^a v_{k - 1}\right) \end{align}

## Summary

It’s proven that Policy iteration will give us optimal policy $$\pi_*$$. Doesn’t work in general state spaces. Also, we are only finding one out of many possible optimal policies given this reward function.

[back]