Reinforcement Learning, Pt. 1

A Markov Decision Process (MDP) is a $5$-tuple

$$(S,A,P,\gamma,R),$$
where

  1. $S$ is a set of states,
  2. $A$ is a set of actions,
  3. $P:(s,a,s')\mapsto[0,1]$ is a probability distribution representing the probability of going from state $s$ to state $s'$ after taking action $a$,
  4. $\gamma\in[0,1)$, and
  5. $R:S\to\mathbb R$ and $R:S\times A\to\mathbb R$ represent the reward for being in state $s$ and for being in state $s$ after taking action $a$, respectively.
A policy is a map $\pi:S\to A$.

A value associated to a policy $\pi$ is a map

$$\begin{align}V^\pi:S&\to\mathbb R\\s&\mapsto\mathbb E[R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots|s_0=s,\pi]\\&=R(s)+\gamma\sum_{s'\in S}P(s,\pi(s),s')V^\pi(s').\end{align}$$
The last expression above is called a Bellman equation.

A policy and its associated value are dual in the sense that one can be determined from the other.

The optimal value is the map

$$\begin{align}V^*:S&\to\mathbb R\\s&\mapsto\max_\pi V^\pi(s)\\&=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s,a,s')V^*(s').\end{align}$$

The optimal policy is the map

$$\begin{align}\pi^*:S&\to A\\s&\mapsto\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V^*(s').\end{align}$$

The following is the value iteration algorithm:

  1. Set $V(s)=0$ for all $s\in S$.
  2. For all $s\in S$, set $$V(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
  3. Repeat step 2 until convergence.
Convergence is due to the fact that $V$ is a contraction mapping.

The following is the policy iteration algorithm:

  1. Initialize $\pi$ randomly.
  2. Set $V=V^\pi$.
  3. For all $s\in S$, set $$\pi(s)=\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
  4. Repeat step 2 until convergence.