Lecture 2: Markov Decision Processes | Reinforcement Learning | David Silver | Course

1. Markov Process / Markov chain

1.1. Markov process

A?Markov process?or?Markov chain?is a tuple $\langle S,P \rangle$ such that

  • $S$ is a finite set of states, and
  • $P$ is a transition probability matrix.

In a? Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.

1.2. State transition probability

A?state transition probability?from $s$ to $s’$ is

\[
P_{ss’}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s]
\]

Episode

An?episode?of a Markov process is a?trajectory of the transition states.


2. Markov reward process

2.1. Markov reward process (MRP)

A?Markov reward process?is a tuple $\langle S,P, R, \gamma \rangle$ such that

  • $S$ is a finite set of states,
  • $P$ is a transition probability matrix,
  • $R$ is a reward function, , and
    • A reward function on state $s$ is $R_{s}=\mathbb{E}[r_{t+1}|S_{t}=s]$.
    • $r_{t}$ is a reward at time $t$.
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$.
The steps?of a MDP [sungjae]
  1. Be on a state.
  2. Move to a new state.
  3. Gain a reward on arriving at the state.
  4. Go to step 1
2.2. Return

The?return?at time-step $t$ is denoted by?$G_{t}$.
$G_{t}$ is?the total discounted reward?from time-step $t$.

\[
G_{t}=r_{t+1}+\gamma?r_{t+2}+\cdots = \sum_{k=0}^{\infty}{\gamma^{k}r_{t+1+k}}
\]

2.3. Value function in MRP
State value function

The state value function $v(s)$ of a Markov reward process is?the expected return starting from state $s$.

\[
v(s)=\mathbb{E}[G_{t}|S_{t}=s]
\]

2.4. Bellman equation for MRP
Bellman equation for Markov reward processes

The Bellman equation for MRPs implies that $v(s)$ can be represented by the multiple $v(s’)$.

\[
v(s)=R_{s}+\gamma\sum_{s’\in S}{P_{ss’}v(s’)}
\]

Bellman equation in matrix form

\[

\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}
=
\begin{bmatrix}
R_{1}?\\ \vdots \\?R_{n}
\end{bmatrix}

+
\gamma

\begin{bmatrix}
P_{11} & \cdots & P_{1n} \\
\vdots & \ddots & \vdots \\
P_{n1} & \cdots & P_{nn}
\end{bmatrix}

\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}

\]

\[
v=R+\gamma P v
\]

Solving the Bellman equation

\[
v=R+\gamma P v
\]

\[
(I-\gamma P)v=R
\]

\[
v=(I-\gamma P)^{-1}R
\]

  • The computational complexity to solve $v$ is $O(n^3)$ for $n$ states.
  • This analytic method is only possible if $n$ is small.
  • If $n$ is large, the following iterative methods are used to solve MRPs.
    1. Dynamic programming
    2. Monte-Carlo evalution
    3. Temporal-Difference learning

3. Markov decision process

3.1. Markov decision process (MDP)

A?Markov decision process?(MDP) is a Markov reward process with decisions of actions.

A?Markov decision process?is a tuple $\langle S,A, P, R, \gamma \rangle$ such that

  • $S$ is a finite set of states,
  • $A$ is a finite set of actions,
  • $P$ is a transition probability matrix,
    • $P_{ss’}^{a}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s, A_{t}=a]$
  • $R$ is a reward function, , and
    • A reward function on state $s$ is $R_{s}^{a}=\mathbb{E}[r_{t+1} | S_{t}=s, A_{t}=a]$.
    • $r_{t}$ is a reward at time $t$.
  • $\gamma$ is a discount factor, $\gamma \in [0,1]$.
The steps?of a MDP [sungjae]
  1. Be on a state.
  2. Take an action.
  3. Gain a reward by the action
  4. Arrive at a new state.
  5. Go to step 1
3.2. Policy

A?policy?$\pi$ is a distribution over actions $a$ given states $s$.

\[
\pi (a|s) = \mathbb{P}[A_{t}=a | S_{t}=s]
\]

  • A policy defines the action[behavior] of an agent.
  • MDP policies only depend on the current state.
  • MDP policies are parameterized by states.
  • Policies are?stationary[time-independent].
    • $A_{t}\sim?\pi(\cdot | S_{t}), \forall{t} > 0$
    • stationary $\neq$ dynamic
3.3. Value functions in Markov decision processes

There two value functions in MDPs: [1] the state-value function and [2] action-value function.

State-value function

The?state-value function?$v_{\pi}(s)$ of an MDP is the?expected return?[1] starting from state $s$ and [2] following policy $\pi$.

\[
v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]
\]

Action-value function

The?action-value function?$q_{\pi}(s,a)$ is the?expected return?[1] starting from state $s$, [2] taking action $a$, and then [3] following policy $\pi$

\[
q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t} | S_{t} = s , A_{t} = a]
\]

3.4. Bellman expectation equation in MDP
Bellman expectation equation for $v^{\pi}$

\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s)q_{\pi}(s,a)}
\]

  • $s$ is the current state.
  • $A$ is the?set of the actions able to be taken on $s$.
  • $a\in A$ is an action on $s$.
  • $\pi$ is?a particular?policy.
  • Conclusion: $v_{\pi}(s)$ can be solved by $q_{\pi}(s,a)$.
Bellman expectation equation for $q^{\pi}$

\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)}
\]

  • $R_{s}^{a}$ is the reward function that returns reward if action $a$ is taken on state $s$.
  • Conclusion: $q_{\pi}(s,a)$ can be solved by $v_{\pi}(s’)$.
Bellman expectation equation for $v^{\pi}$ (2)

\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s) R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)} }
\]

  • Conclusion: $v_{\pi}(s)$ can be solved by $v_{\pi}(s’)$.
Bellman expectation equation for $q^{\pi}$ (2)

\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} \sum_{a\in A}{\pi(a|s’)q_{\pi}(s’,a)}
\]

  • Conclusion: $q_{\pi}(s,a)$ can be solved by $q_{\pi}(s’,a)$.
Bellman expectation equation (matrix form)

\[
v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi}
\]

\[
v_{\pi} = (I – \gamma P^{\pi})^{-1} R^{\pi}
\]

3.5. Optimal value function
  • The?optimal value function?specifies the best possible performance in the MDP.
  • An MDP is solved‘ when we know the optimal value function.
Optimal state-value function

The?optimal state-value function?is $v_{*}(s)$ is the maximum value function over all policies.

\[
v_{*}(s) = \max_{\pi}{v_{\pi}(s)}
\]

Optimal action-value function

The?optimal action-value function?is $q_{*}(s,a)$ is the maximum value function over all policies.

\[
q_{*}(s,a) = \max_{\pi} q_{\pi}(s,a)
\]

Partial ordering over policies

\[
\pi \geq \pi^{‘} \textup{ if } v_{\pi}(s) \geq?v_{\pi^{‘}}(s), \forall{s}
\]

Theorems

For any Markov decision process

  • [★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e.,?\pi \geq \pi^{*}, \forall{\pi}.
  • All optimal policies achieve the optimal value function.
  • All optimal policies achieve the optimal action function.
Finding an optimal policy

An optimal policy can be constructed by following formulation.

\[
\pi_{*}(a|s)=
\begin{cases}
1 & \text{ if } a = \underset{a \in A}{\arg\max} \ {q_{*}(s,a)} \\
0 & \text{ otherwise }
\end{cases}
\]

3.6. Bellman optimality equation
Bellman optimality equation for $v_{*}$

\[
v_{*}(s)=\underset{a}\max q_{*}(s,a)
\]

Bellman optimality equation for $q_{*}$

\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\]

Bellman optimality equation for $v_{*}$ (2)

\[
v_{*}(s)=\underset{a}\max
\left[
R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\right]
\]

Bellman optimality equation for $q_{*}$ (2)

\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a}?\underset{a’}\max q_{*}(s’,a’)}
\]


4. Extensions to MDPs

[Our formulation] → [Extension]

  • finite states?→ infinite states
  • discrete?states?→ continuous states
  • fully observable states?→ partially?observable states
  • discounted future reward?→ undiscounted future reward
  • maximum future reward?→?averaged future reward

Leave a Reply

Your email address will not be published. Required fields are marked *