where
is a set of states,-
is a set of actions, -
is a probability distribution representing the probability of going from state to state after taking action , -
, and -
and represent the reward for being in state and for being in state after taking action , respectively.
A value associated to a policy is a map
The last expression above is called a Bellman equation.
A policy and its associated value are dual in the sense that one can be determined from the other.
The optimal value is the map
The optimal policy is the map
The following is the value iteration algorithm:
- Set
for all . - For all
, set - Repeat step 2 until convergence.
Convergence is due to the fact that is a contraction mapping.
The following is the policy iteration algorithm:
- Initialize
randomly. - Set
. - For all
, set - Repeat step 2 until convergence.