Reinforcement Learning, Pt. 1

A Markov Decision Process (MDP) is a 5-tuple

(S,A,P,γ,R),
where

  1. S is a set of states,
  2. A is a set of actions,
  3. P:(s,a,s)[0,1] is a probability distribution representing the probability of going from state s to state s after taking action a,
  4. γ[0,1), and
  5. R:SR and R:S×AR represent the reward for being in state s and for being in state s after taking action a, respectively.
A policy is a map π:SA.

A value associated to a policy π is a map

Vπ:SRsE[R(s0)+γR(s1)+γ2R(s2)+|s0=s,π]=R(s)+γsSP(s,π(s),s)Vπ(s).
The last expression above is called a Bellman equation.

A policy and its associated value are dual in the sense that one can be determined from the other.

The optimal value is the map

V:SRsmaxπVπ(s)=R(s)+γmaxaAsSP(s,a,s)V(s).

The optimal policy is the map

π:SAsargmaxaAsSP(s,a,s)V(s).

The following is the value iteration algorithm:

  1. Set V(s)=0 for all sS.
  2. For all sS, set V(s)=R(s)+γmaxaAsSP(s,a,s)V(s).
  3. Repeat step 2 until convergence.
Convergence is due to the fact that V is a contraction mapping.

The following is the policy iteration algorithm:

  1. Initialize π randomly.
  2. Set V=Vπ.
  3. For all sS, set π(s)=argmaxaAsSP(s,a,s)V(s).
  4. Repeat step 2 until convergence.