Lecture 2: Markov Decision Processes | Reinforcement Learning | David Silver | Course

1. Markov Process / Markov chain

1.1. Markov process

A?Markov process?or?Markov chain?is a tuple $\langle S,P \rangle$ such that

$S$ is a finite set of states, and
$P$ is a transition probability matrix.

In a? Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.

1.2. State transition probability

A?state transition probability?from $s$ to $s’$ is

\[
P_{ss’}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s]
\]

Episode

An?episode?of a Markov process is a?trajectory of the transition states.

2. Markov reward process

2.1. Markov reward process (MRP)

A?Markov reward process?is a tuple $\langle S,P, R, \gamma \rangle$ such that

$S$ is a finite set of states,
$P$ is a transition probability matrix,
$R$ is a reward function, , and
- A reward function on state $s$ is $R_{s}=\mathbb{E}[r_{t+1}|S_{t}=s]$.
- $r_{t}$ is a reward at time $t$.
$\gamma$ is a discount factor, $\gamma \in [0,1]$.

The steps?of a MDP [sungjae]

Be on a state.
Move to a new state.
Gain a reward on arriving at the state.
Go to step 1

2.2. Return

The?return?at time-step $t$ is denoted by?$G_{t}$.
$G_{t}$ is?the total discounted reward?from time-step $t$.

\[
G_{t}=r_{t+1}+\gamma?r_{t+2}+\cdots = \sum_{k=0}^{\infty}{\gamma^{k}r_{t+1+k}}
\]

2.3. Value function in MRP

State value function

The state value function $v(s)$ of a Markov reward process is?the expected return starting from state $s$.

\[
v(s)=\mathbb{E}[G_{t}|S_{t}=s]
\]

2.4. Bellman equation for MRP

Bellman equation for Markov reward processes

The Bellman equation for MRPs implies that $v(s)$ can be represented by the multiple $v(s’)$.

\[
v(s)=R_{s}+\gamma\sum_{s’\in S}{P_{ss’}v(s’)}
\]

Bellman equation in matrix form

\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}
=
\begin{bmatrix}
R_{1}?\\ \vdots \\?R_{n}
\end{bmatrix}

+
\gamma

\begin{bmatrix}
P_{11} & \cdots & P_{1n} \\
\vdots & \ddots & \vdots \\
P_{n1} & \cdots & P_{nn}
\end{bmatrix}

\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}

\[
v=R+\gamma P v
\]

Solving the Bellman equation

\[
v=R+\gamma P v
\]

\[
(I-\gamma P)v=R
\]

\[
v=(I-\gamma P)^{-1}R
\]

The computational complexity to solve $v$ is $O(n^3)$ for $n$ states.
This analytic method is only possible if $n$ is small.
If $n$ is large, the following iterative methods are used to solve MRPs.
1. Dynamic programming
2. Monte-Carlo evalution
3. Temporal-Difference learning

3. Markov decision process

3.1. Markov decision process (MDP)

A?Markov decision process?(MDP) is a Markov reward process with decisions of actions.

A?Markov decision process?is a tuple $\langle S,A, P, R, \gamma \rangle$ such that

$S$ is a finite set of states,
$A$ is a finite set of actions,
$P$ is a transition probability matrix,
- $P_{ss’}^{a}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s, A_{t}=a]$
$R$ is a reward function, , and
- A reward function on state $s$ is $R_{s}^{a}=\mathbb{E}[r_{t+1} | S_{t}=s, A_{t}=a]$.
- $r_{t}$ is a reward at time $t$.
$\gamma$ is a discount factor, $\gamma \in [0,1]$.

The steps?of a MDP [sungjae]

Be on a state.
Take an action.
Gain a reward by the action
Arrive at a new state.
Go to step 1

3.2. Policy

A?policy?$\pi$ is a distribution over actions $a$ given states $s$.

\[
\pi (a|s) = \mathbb{P}[A_{t}=a | S_{t}=s]
\]

A policy defines the action[behavior] of an agent.
MDP policies only depend on the current state.
MDP policies are parameterized by states.
Policies are?stationary[time-independent].
- $A_{t}\sim?\pi(\cdot | S_{t}), \forall{t} > 0$
- stationary $\neq$ dynamic

3.3. Value functions in Markov decision processes

There two value functions in MDPs: [1] the state-value function and [2] action-value function.

State-value function

The?state-value function?$v_{\pi}(s)$ of an MDP is the?expected return?[1] starting from state $s$ and [2] following policy $\pi$.

\[
v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]
\]

Action-value function

The?action-value function?$q_{\pi}(s,a)$ is the?expected return?[1] starting from state $s$, [2] taking action $a$, and then [3] following policy $\pi$

\[
q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t} | S_{t} = s , A_{t} = a]
\]

3.4. Bellman expectation equation in MDP

Bellman expectation equation for $v^{\pi}$

\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s)q_{\pi}(s,a)}
\]

$s$ is the current state.
$A$ is the?set of the actions able to be taken on $s$.
$a\in A$ is an action on $s$.
$\pi$ is?a particular?policy.
Conclusion: $v_{\pi}(s)$ can be solved by $q_{\pi}(s,a)$.

Bellman expectation equation for $q^{\pi}$

\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)}
\]

$R_{s}^{a}$ is the reward function that returns reward if action $a$ is taken on state $s$.
Conclusion: $q_{\pi}(s,a)$ can be solved by $v_{\pi}(s’)$.

Bellman expectation equation for $v^{\pi}$ (2)

\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s) R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)} }
\]

Conclusion: $v_{\pi}(s)$ can be solved by $v_{\pi}(s’)$.

Bellman expectation equation for $q^{\pi}$ (2)

\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} \sum_{a\in A}{\pi(a|s’)q_{\pi}(s’,a)}
\]

Conclusion: $q_{\pi}(s,a)$ can be solved by $q_{\pi}(s’,a)$.

Bellman expectation equation (matrix form)

\[
v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi}
\]

\[
v_{\pi} = (I – \gamma P^{\pi})^{-1} R^{\pi}
\]

3.5. Optimal value function

The?optimal value function?specifies the best possible performance in the MDP.
‘An MDP is solved‘ when we know the optimal value function.

Optimal state-value function

The?optimal state-value function?is $v_{*}(s)$ is the maximum value function over all policies.

\[
v_{*}(s) = \max_{\pi}{v_{\pi}(s)}
\]

Optimal action-value function

The?optimal action-value function?is $q_{*}(s,a)$ is the maximum value function over all policies.

\[
q_{*}(s,a) = \max_{\pi} q_{\pi}(s,a)
\]

Partial ordering over policies

\[
\pi \geq \pi^{‘} \textup{ if } v_{\pi}(s) \geq?v_{\pi^{‘}}(s), \forall{s}
\]

Theorems

For any Markov decision process

[★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e.,?\pi \geq \pi^{*}, \forall{\pi}.
All optimal policies achieve the optimal value function.
All optimal policies achieve the optimal action function.

Finding an optimal policy

An optimal policy can be constructed by following formulation.

\[
\pi_{*}(a|s)=
\begin{cases}
1 & \text{ if } a = \underset{a \in A}{\arg\max} \ {q_{*}(s,a)} \\
0 & \text{ otherwise }
\end{cases}
\]

3.6. Bellman optimality equation

Bellman optimality equation for $v_{*}$

\[
v_{*}(s)=\underset{a}\max q_{*}(s,a)
\]

Bellman optimality equation for $q_{*}$

\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\]

Bellman optimality equation for $v_{*}$ (2)

\[
v_{*}(s)=\underset{a}\max
\left[
R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\right]
\]

Bellman optimality equation for $q_{*}$ (2)

\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a}?\underset{a’}\max q_{*}(s’,a’)}
\]

4. Extensions to MDPs

[Our formulation] → [Extension]

finite states?→ infinite states
discrete?states?→ continuous states
fully observable states?→ partially?observable states
discounted future reward?→ undiscounted future reward
maximum future reward?→?averaged future reward

1. Markov Process / Markov chain

1.1. Markov process

1.2. State transition probability

Episode

2. Markov reward process

2.1. Markov reward process (MRP)

The steps?of a MDP [sungjae]

2.2. Return

2.3. Value function in MRP

State value function

2.4. Bellman equation for MRP

Bellman equation for Markov reward processes

Bellman equation in matrix form

Solving the Bellman equation

3. Markov decision process

3.1. Markov decision process (MDP)

The steps?of a MDP [sungjae]

3.2. Policy

3.3. Value functions in Markov decision processes

State-value function

Action-value function

3.4. Bellman expectation equation in MDP

Bellman expectation equation for $v^{\pi}$

Bellman expectation equation for $q^{\pi}$

Bellman expectation equation for $v^{\pi}$ (2)

Bellman expectation equation for $q^{\pi}$ (2)

Bellman expectation equation (matrix form)

3.5. Optimal value function

Optimal state-value function

Optimal action-value function

Partial ordering over policies

Theorems

Finding an optimal policy

3.6. Bellman optimality equation

Bellman optimality equation for $v_{*}$

Bellman optimality equation for $q_{*}$

Bellman optimality equation for $v_{*}$ (2)

Bellman optimality equation for $q_{*}$ (2)

4. Extensions to MDPs

Related Posts

Leave a Reply Cancel reply