08 November 2016
These are notes of lecture 3 of David Silver’s course on reinforcement learning.
Dynamic programming (DP) methods for reinforcement learning (RL) are pretty useless but provide a foundation for understanding the state-of-the-art methods.
We use this notation and vector representation of functions of finite domain (e.g. \(v_{\pi}: \mathcal S \to \mathbb R\) is represented by a \(\left\vert\mathcal S\right\vert\)-dimensional vector).
Goal: Given an MDP \((\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)\) and a policy \(\pi\), find \(v_{\pi}\).
Algorithm:
This comes from the Bellman expectation equation. Can be proved to converge to \(v_\pi\).
Goal: Given an MDP \((\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)\), find the optimal policy \(\pi_*\).
Algorithm:
\(\mathrm{IterativePolicyEvaluation}\) is an algorithm (like the one from the previous section) to obtain the optimal value function given the MDP and the policy from previous iteration \(\pi_{k - 1}\). \(\mathrm{Greedy}(v_k)\) means
\begin{align}
\pi_k(a | s) =
\begin{cases}
1 & \text{if } a = \argmax_{a’} \left\{ q_k(s, a’) = \mathcal R_s^{a’} + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^{a’} v_k(s’) \right\} \\
0 & \text{otherwise.}\\
\end{cases}
\end{align}
Proven to converge to \(\pi_*\) via Contraction Mapping Theorem.
Goal: Prove why value function monotonically increases in policy iteration.
\begin{align} v_{k - 1}(s) = \cdots \leq \cdots = v_k(s) \end{align}
Goal: Given an MDP \((\mathcal S, \mathcal A, \mathcal P, \mathcal R, \gamma)\), find the optimal policy \(\pi_*\).
Algorithm:
It’s proven that Policy iteration will give us optimal policy \(\pi_*\). Doesn’t work in general state spaces. Also, we are only finding one out of many possible optimal policies given this reward function.
[back]