Reinforcement Learning, Pt. 1 ~ Math Crumbs

A Markov Decision Process (MDP) is a $5$-tuple

$$(S,A,P,\gamma,R),$$

where

$S$ is a set of states,
$A$ is a set of actions,
$P:(s,a,s')\mapsto[0,1]$ is a probability distribution representing the probability of going from state $s$ to state $s'$ after taking action $a$,
$\gamma\in[0,1)$, and
$R:S\to\mathbb R$ and $R:S\times A\to\mathbb R$ represent the reward for being in state $s$ and for being in state $s$ after taking action $a$, respectively.

A policy is a map $\pi:S\to A$.

A value associated to a policy $\pi$ is a map

$$\begin{align}V^\pi:S&\to\mathbb R\\s&\mapsto\mathbb E[R(s_0)+\gamma R(s_1)+\gamma^2R(s_2)+\cdots|s_0=s,\pi]\\&=R(s)+\gamma\sum_{s'\in S}P(s,\pi(s),s')V^\pi(s').\end{align}$$

The last expression above is called a Bellman equation.

A policy and its associated value are dual in the sense that one can be determined from the other.

The optimal value is the map

$$\begin{align}V^*:S&\to\mathbb R\\s&\mapsto\max_\pi V^\pi(s)\\&=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s,a,s')V^*(s').\end{align}$$

The optimal policy is the map

$$\begin{align}\pi^*:S&\to A\\s&\mapsto\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V^*(s').\end{align}$$

The following is the value iteration algorithm:

Set $V(s)=0$ for all $s\in S$.
For all $s\in S$, set $$V(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
Repeat step 2 until convergence.

Convergence is due to the fact that $V$ is a contraction mapping.

The following is the policy iteration algorithm:

Initialize $\pi$ randomly.
Set $V=V^\pi$.
For all $s\in S$, set $$\pi(s)=\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
Repeat step 2 until convergence.

Math Crumbs

A Blog About Mathematics and Programming • 🎉 Celebrating 10 Years 🎉

Menu

Reinforcement Learning, Pt. 1

0 comments:

Post a Comment

Blog Archive