24 September 2016
These are notes of lecture 2 of David Silver’s course on reinforcement learning.
The goal is to set up a framework which is commonly used in the current reinforcement learning literature.
Structure:
Markov reward processes
Markov decision processes —> Bellman Expectation Equations
Optimal policies —> Bellman Optimality Equations
We define the MRP framework and derive some recursive equations.
\(\mathcal S\) | \(\dots\) | finite set of states |
\(\mathcal P\) | \(\dots\) | state transition matrix |
\(\mathcal R\) | \(\dots\) | reward function |
\(\gamma\) | \(\dots\) | discount factor |
States \(\mathcal S = \{1, \dotsc, n\}\).
State transition matrix:
\begin{align}
\mathcal P_{ss’} &= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \P(S_{t + 1} = s’ \given S_t = s, S_{t - 1} = s_{t - 1}, \dotsc, S_1 = s_1) \\
\mathcal P &=
\begin{bmatrix}
\mathcal P_{11} & \cdots & \mathcal P_{1n} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1} & \cdots & \mathcal P_{nn}
\end{bmatrix}.
\end{align}
Reward function:
\begin{align}
\mathcal R(s) &= \E[R_{t + 1} \given S_t = s] \\
&=: \mathcal R_s, \label{eq:notes/bellman/reward}
\end{align}
where \(R_{t + 1}\) is a random variable.
Discount factor \(\gamma \in [0, 1]\).
Return \begin{align} G_t &= \underbrace{\sum_{k = 0}^\infty \gamma^k R_{t + k + 1}}_{\text{random variable}}. \end{align}
Value function
\begin{align}
v(s) &= \E[G_t \given S_t = s] \\
&= \E[R_{t + 1} + \gamma R_{t + 2} + \cdots \given S_t = s] \\
&= \E[R_{t + 1} \given S_t = s] + \gamma \E[R_{t + 2} + \gamma R_{t + 3} + \cdots \given S_t = s] & \text{(linearity of expectations)} \\
&= \mathcal R_s + \gamma \E[G_{t + 1} \given S_t = s] & \text{(from reward function in \eqref{eq:notes/bellman/reward})}\\
&= \mathcal R_s + \gamma \int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_t = s) \,\mathrm dg_{t + 1} & \text{(denote } p \text{ as the density of the corresponding r.v.)} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’, S_t = s) p(S_{t + 1} = s’ \given S_t = s) \,\mathrm d s’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \mathcal P_{ss’} \,\mathrm ds’\right) \,\mathrm dg_{t + 1} \\
&= \mathcal R_s + \gamma \int \left(\int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \,\mathrm dg_{t + 1}\right) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int \E[G_{t + 1} \given S_{t + 1} = s’] \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \int v(s’) \mathcal P_{ss’} \,\mathrm ds’ \\
&= \mathcal R_s + \gamma \sum_{s’ \in \mathcal S} v(s’) \mathcal P_{ss’}.
\end{align}
We define the MDP framework and derive some recursive definitions of various quantities.
\(\mathcal S\) | \(\dots\) | finite set of states |
\(\mathcal A\) | \(\dots\) | finite set of actions |
\(\mathcal P\) | \(\dots\) | state transition matrix |
\(\mathcal R\) | \(\dots\) | reward function |
\(\gamma\) | \(\dots\) | discount factor |
Policy: \begin{align} \pi(a \given s) = \P(A_t = a \given S_t = s) \end{align}
State transition matrix: \begin{align} \mathcal P_{ss’}^a = \P(S_{t + 1} = s’ \given S_t = s, A_t = a). \end{align}
Also define
\begin{align}
\mathcal P_{ss’}^{\pi} &:= \P(S_{t + 1} = s’ \given S_t = s) \\
&= \sum_{a \in \mathcal A} \P(S_{t + 1} = s’ \given S_t = s, A_t = a) \P(A_t = a \given S_t = s) \\
&= \sum_{a \in \mathcal A} \mathcal P_{ss’}^a \pi(a \given s)
\end{align}
and
\begin{align}
\mathcal P^{\pi} &:=
\begin{bmatrix}
\mathcal P_{11}^{\pi} & \cdots & \mathcal P_{1n}^{\pi} \\
\vdots & \ddots & \vdots \\
\mathcal P_{n1}^{\pi} & \cdots & \mathcal P_{nn}^{\pi}
\end{bmatrix}.
\end{align}
Reward function
\begin{align}
\mathcal R(s, a) &= \E[R_{t + 1} \given S_t = s, A_t = a] \\
&=: \mathcal R_s^a.
\end{align}
Also define
\begin{align}
\mathcal R_s^\pi &:= \E[R_{t + 1} \given S_t = s] \\
&= \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s) \,\mathrm dr_{t + 1} \\
&= \int r_{t + 1} \int p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) p(A_t = a | S_t = s) \,\mathrm da \,\mathrm dr_{t + 1} \\
&= \int \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) \,\mathrm dr_{t + 1} p(A_t = a | S_t = s) \,\mathrm da \\
&= \int \E[R_{t + 1} | S_t = 1, A_t = 1] \pi(a | s) \,\mathrm da \\
&= \sum_{a \in \mathcal A} \mathcal R_s^a \pi(a \given s)
\end{align}
and
\begin{align}
\mathcal R^{\pi} &:= [\mathcal R_1^{\pi} \cdots \mathcal R_n^{\pi}]^T.
\end{align}
State-value function of an MDP \begin{align} v_\pi(s) &:= \E[G_t \given S_t = s]. \end{align}
Action-value function of an MDP \begin{align} q_\pi(s, a) &:= \E[G_t \given S_t = s, A_t = a]. \end{align}
We have the following Bellman Expectation Equations:
\begin{align}
v_\pi(s) &= \mathcal R_s^\pi + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^\pi \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) (\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^a) \\
v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) q_\pi(s, a) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \sum_{a’ \in \mathcal A} \pi(a’ \given s’) q_\pi(s’, a’) \\
q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\pi(s’)
\end{align}
We formalize what it means to find an optimal policy.
Optimal value function: \begin{align} v_*(s) &= \max_\pi v_\pi(s). \end{align}
Optimal action-value function: \begin{align} q_*(s, a) = \max_\pi q_\pi(s, a). \end{align}
Optimal policy theorem:
Define a partial order \(\geq\) on policies so that \(\pi \geq \pi'\) if \begin{align} v_\pi(s) \geq v_{\pi’}(s), \forall s \in \mathcal S. \end{align} For any MDP,
One example of an optimal policy is \begin{align} \pi_\ast(a \given s) &= \delta(a - \mathrm{argmax}_{a \in \mathcal A} q_\ast(s, a)). \end{align}
This gives rise to Bellman Optimality Equations:
\begin{align}
v_\ast(s) &= \max_{a \in \mathcal A} q_\ast(s, a) \\
v_\ast(s) &= \max_{a \in \mathcal A} \left(\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’)\right) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’) \\
q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \max_{a’ \in \mathcal A} q_\ast(s’, a’).
\end{align}
[back]