$$(S,A,P,\gamma,R),$$
where
- $S$ is a set of states,
- $A$ is a set of actions,
- $P:(s,a,s')\mapsto[0,1]$ is a probability distribution representing the probability of going from state $s$ to state $s'$ after taking action $a$,
- $\gamma\in[0,1)$, and
- $R:S\to\mathbb R$ and $R:S\times A\to\mathbb R$ represent the reward for being in state $s$ and for being in state $s$ after taking action $a$, respectively.
A value associated to a policy $\pi$ is a map
The last expression above is called a Bellman equation.
A policy and its associated value are dual in the sense that one can be determined from the other.
The optimal value is the map
The optimal policy is the map
$$\begin{align}\pi^*:S&\to A\\s&\mapsto\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V^*(s').\end{align}$$
The following is the value iteration algorithm:
- Set $V(s)=0$ for all $s\in S$.
- For all $s\in S$, set $$V(s)=R(s)+\gamma\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
- Repeat step 2 until convergence.
Convergence is due to the fact that $V$ is a contraction mapping.
The following is the policy iteration algorithm:
- Initialize $\pi$ randomly.
- Set $V=V^\pi$.
- For all $s\in S$, set $$\pi(s)=\arg\max_{a\in A}\sum_{s'\in S}P(s,a,s')V(s').$$
- Repeat step 2 until convergence.