Reinforcement Learning, Pt. 1 ~ Math Crumbs

A Markov Decision Process (MDP) is a

5

-tuple

(S, A, P, γ, R),

where

$S$ is a set of states,
$A$ is a set of actions,
$P : (s, a, s^{'}) \mapsto [0, 1]$ is a probability distribution representing the probability of going from state $s$ to state $s^{'}$ after taking action $a$ ,
$γ \in [0, 1)$ , and
$R : S \to R$ and $R : S \times A \to R$ represent the reward for being in state $s$ and for being in state $s$ after taking action $a$ , respectively.

A policy is a map

π : S \to A

A value associated to a policy

π

is a map

\begin{aligned} V^{π} : S & \to R \\ s & \mapsto E [R (s_{0}) + γ R (s_{1}) + γ^{2} R (s_{2}) + \dots | s_{0} = s, π] \\ = R (s) + γ \sum_{s^{'} \in S} P (s, π (s), s^{'}) V^{π} (s^{'}) . \end{aligned}

The last expression above is called a Bellman equation.

A policy and its associated value are dual in the sense that one can be determined from the other.

The optimal value is the map

\begin{aligned} V^{*} : S & \to R \\ s & \mapsto max_{π} V^{π} (s) \\ = R (s) + γ max_{a \in A} \sum_{s^{'} \in S} P (s, a, s^{'}) V^{*} (s^{'}) . \end{aligned}

The optimal policy is the map

\begin{aligned} π^{*} : S & \to A \\ s & \mapsto \arg max_{a \in A} \sum_{s^{'} \in S} P (s, a, s^{'}) V^{*} (s^{'}) . \end{aligned}

The following is the value iteration algorithm:

Set $V (s) = 0$ for all $s \in S$ .
For all $s \in S$ , set $V (s) = R (s) + γ max_{a \in A} \sum_{s^{'} \in S} P (s, a, s^{'}) V (s^{'}) .$
Repeat step 2 until convergence.

Convergence is due to the fact that

V

is a contraction mapping.

The following is the policy iteration algorithm:

Initialize $π$ randomly.
Set $V = V^{π}$ .
For all $s \in S$ , set $π (s) = \arg max_{a \in A} \sum_{s^{'} \in S} P (s, a, s^{'}) V (s^{'}) .$
Repeat step 2 until convergence.