24 September 2016
These are notes of lecture 2 of David Silver’s course on reinforcement learning.
The goal is to set up a framework which is commonly used in the current reinforcement learning literature.
Structure:
Markov reward processes
Markov decision processes —> Bellman Expectation Equations
Optimal policies —> Bellman Optimality Equations
We define the MRP framework and derive some recursive equations.
| \(\mathcal S\) | \(\dots\) | finite set of states | 
| \(\mathcal P\) | \(\dots\) | state transition matrix | 
| \(\mathcal R\) | \(\dots\) | reward function | 
| \(\gamma\) | \(\dots\) | discount factor | 
States \(\mathcal S = \{1, \dotsc, n\}\).
State transition matrix:
\begin{align}
    \mathcal P_{ss’} &= \P(S_{t + 1} = s’ \given S_t = s) \\
    &= \P(S_{t + 1} = s’ \given S_t = s, S_{t - 1} = s_{t - 1}, \dotsc, S_1 = s_1) \\
    \mathcal P &=
        \begin{bmatrix}
            \mathcal P_{11} & \cdots & \mathcal P_{1n} \\
            \vdots & \ddots & \vdots \\
            \mathcal P_{n1} & \cdots & \mathcal P_{nn}
        \end{bmatrix}.
\end{align}
Reward function:
\begin{align}
    \mathcal R(s) &= \E[R_{t + 1} \given S_t = s] \\
    &=: \mathcal R_s, \label{eq:notes/bellman/reward}
\end{align}
where \(R_{t + 1}\) is a random variable.
Discount factor \(\gamma \in [0, 1]\).
Return \begin{align} G_t &= \underbrace{\sum_{k = 0}^\infty \gamma^k R_{t + k + 1}}_{\text{random variable}}. \end{align}
Value function
\begin{align}
    v(s) &= \E[G_t \given S_t = s] \\
    &= \E[R_{t + 1} + \gamma R_{t + 2} + \cdots \given S_t = s] \\
    &= \E[R_{t + 1} \given S_t = s] + \gamma \E[R_{t + 2} + \gamma R_{t + 3} + \cdots \given S_t = s] & \text{(linearity of expectations)} \\
    &= \mathcal R_s + \gamma \E[G_{t + 1} \given S_t = s] & \text{(from reward function in \eqref{eq:notes/bellman/reward})}\\
    &= \mathcal R_s + \gamma \int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_t = s) \,\mathrm dg_{t + 1} & \text{(denote } p \text{ as the density of the corresponding r.v.)} \\
    &= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’, S_t = s) p(S_{t + 1} = s’ \given S_t = s) \,\mathrm d s’\right) \,\mathrm dg_{t + 1} \\
    &= \mathcal R_s + \gamma \int g_{t + 1} \left(\int p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \mathcal P_{ss’} \,\mathrm ds’\right) \,\mathrm dg_{t + 1} \\
    &= \mathcal R_s + \gamma \int \left(\int g_{t + 1} p(G_{t + 1} = g_{t + 1} \given S_{t + 1} = s’) \,\mathrm dg_{t + 1}\right) \mathcal P_{ss’} \,\mathrm ds’ \\
    &= \mathcal R_s + \gamma \int \E[G_{t + 1} \given S_{t + 1} = s’] \mathcal P_{ss’} \,\mathrm ds’ \\
    &= \mathcal R_s + \gamma \int v(s’) \mathcal P_{ss’} \,\mathrm ds’ \\
    &= \mathcal R_s + \gamma \sum_{s’ \in \mathcal S} v(s’) \mathcal P_{ss’}.
\end{align}
We define the MDP framework and derive some recursive definitions of various quantities.
| \(\mathcal S\) | \(\dots\) | finite set of states | 
| \(\mathcal A\) | \(\dots\) | finite set of actions | 
| \(\mathcal P\) | \(\dots\) | state transition matrix | 
| \(\mathcal R\) | \(\dots\) | reward function | 
| \(\gamma\) | \(\dots\) | discount factor | 
Policy: \begin{align} \pi(a \given s) = \P(A_t = a \given S_t = s) \end{align}
State transition matrix: \begin{align} \mathcal P_{ss’}^a = \P(S_{t + 1} = s’ \given S_t = s, A_t = a). \end{align}
Also define
\begin{align}
    \mathcal P_{ss’}^{\pi} &:= \P(S_{t + 1} = s’ \given S_t = s) \\
    &= \sum_{a \in \mathcal A} \P(S_{t + 1} = s’ \given S_t = s, A_t = a) \P(A_t = a \given S_t = s) \\
    &= \sum_{a \in \mathcal A} \mathcal P_{ss’}^a \pi(a \given s)
\end{align}
and
\begin{align}
    \mathcal P^{\pi} &:=
        \begin{bmatrix}
            \mathcal P_{11}^{\pi} & \cdots & \mathcal P_{1n}^{\pi} \\
            \vdots & \ddots & \vdots \\
            \mathcal P_{n1}^{\pi} & \cdots & \mathcal P_{nn}^{\pi}
        \end{bmatrix}.
\end{align}
Reward function
\begin{align}
    \mathcal R(s, a) &= \E[R_{t + 1} \given S_t = s, A_t = a] \\
    &=: \mathcal R_s^a.
\end{align}
Also define
\begin{align}
    \mathcal R_s^\pi &:= \E[R_{t + 1} \given S_t = s] \\
    &= \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s) \,\mathrm dr_{t + 1} \\
    &= \int r_{t + 1} \int p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a) p(A_t = a | S_t = s) \,\mathrm da \,\mathrm dr_{t + 1} \\
    &= \int \int r_{t + 1} p(R_{t + 1} = r_{t + 1} | S_t = s, A_t = a)  \,\mathrm dr_{t + 1} p(A_t = a | S_t = s) \,\mathrm da \\
    &= \int \E[R_{t + 1} | S_t = 1, A_t = 1] \pi(a | s) \,\mathrm da \\
    &= \sum_{a \in \mathcal A} \mathcal R_s^a \pi(a \given s)
\end{align}
and
\begin{align}
    \mathcal R^{\pi} &:= [\mathcal R_1^{\pi} \cdots \mathcal R_n^{\pi}]^T.
\end{align}
State-value function of an MDP \begin{align} v_\pi(s) &:= \E[G_t \given S_t = s]. \end{align}
Action-value function of an MDP \begin{align} q_\pi(s, a) &:= \E[G_t \given S_t = s, A_t = a]. \end{align}
We have the following Bellman Expectation Equations:
\begin{align}
    v_\pi(s) &= \mathcal R_s^\pi + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^\pi \\
    v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) (\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} v_\pi(s’) \mathcal P_{ss’}^a) \\
    v_\pi(s) &= \sum_{a \in \mathcal A} \pi(a \given s) q_\pi(s, a) \\
    q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \sum_{a’ \in \mathcal A} \pi(a’ \given s’) q_\pi(s’, a’) \\
    q_\pi(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\pi(s’)
\end{align}
We formalize what it means to find an optimal policy.
Optimal value function: \begin{align} v_*(s) &= \max_\pi v_\pi(s). \end{align}
Optimal action-value function: \begin{align} q_*(s, a) = \max_\pi q_\pi(s, a). \end{align}
Optimal policy theorem:
Define a partial order \(\geq\) on policies so that \(\pi \geq \pi'\) if \begin{align} v_\pi(s) \geq v_{\pi’}(s), \forall s \in \mathcal S. \end{align} For any MDP,
One example of an optimal policy is \begin{align} \pi_\ast(a \given s) &= \delta(a - \mathrm{argmax}_{a \in \mathcal A} q_\ast(s, a)). \end{align}
This gives rise to Bellman Optimality Equations:
\begin{align}
    v_\ast(s) &= \max_{a \in \mathcal A} q_\ast(s, a) \\
    v_\ast(s) &= \max_{a \in \mathcal A} \left(\mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’)\right) \\
    q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a v_\ast(s’) \\
    q_\ast(s, a) &= \mathcal R_s^a + \gamma \sum_{s’ \in \mathcal S} \mathcal P_{ss’}^a \max_{a’ \in \mathcal A} q_\ast(s’, a’).
\end{align}
[back]