1. Markov Process / Markov chain
1.1. Markov process
A?Markov process?or?Markov chain?is a tuple $\langle S,P \rangle$ such that
- $S$ is a finite set of states, and
- $P$ is a transition probability matrix.
In a? Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.
1.2. State transition probability
A?state transition probability?from $s$ to $s’$ is
\[
P_{ss’}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s]
\]
Episode
An?episode?of a Markov process is a?trajectory of the transition states.
2. Markov reward process
2.1. Markov reward process (MRP)
A?Markov reward process?is a tuple $\langle S,P, R, \gamma \rangle$ such that
- $S$ is a finite set of states,
- $P$ is a transition probability matrix,
- $R$ is a reward function, , and
- A reward function on state $s$ is $R_{s}=\mathbb{E}[r_{t+1}|S_{t}=s]$.
- $r_{t}$ is a reward at time $t$.
- $\gamma$ is a discount factor, $\gamma \in [0,1]$.
The steps?of a MDP [sungjae]
- Be on a state.
- Move to a new state.
- Gain a reward on arriving at the state.
- Go to step 1
2.2. Return
The?return?at time-step $t$ is denoted by?$G_{t}$.
$G_{t}$ is?the total discounted reward?from time-step $t$.
\[
G_{t}=r_{t+1}+\gamma?r_{t+2}+\cdots = \sum_{k=0}^{\infty}{\gamma^{k}r_{t+1+k}}
\]
2.3. Value function in MRP
State value function
The state value function $v(s)$ of a Markov reward process is?the expected return starting from state $s$.
\[
v(s)=\mathbb{E}[G_{t}|S_{t}=s]
\]
2.4. Bellman equation for MRP
Bellman equation for Markov reward processes
The Bellman equation for MRPs implies that $v(s)$ can be represented by the multiple $v(s’)$.
\[
v(s)=R_{s}+\gamma\sum_{s’\in S}{P_{ss’}v(s’)}
\]
Bellman equation in matrix form
\[
\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}
=
\begin{bmatrix}
R_{1}?\\ \vdots \\?R_{n}
\end{bmatrix}
+
\gamma
\begin{bmatrix}
P_{11} & \cdots & P_{1n} \\
\vdots & \ddots & \vdots \\
P_{n1} & \cdots & P_{nn}
\end{bmatrix}
\begin{bmatrix}
v(1) \\ \vdots \\ v(n)
\end{bmatrix}
\]
\[
v=R+\gamma P v
\]
Solving the Bellman equation
\[
v=R+\gamma P v
\]
\[
(I-\gamma P)v=R
\]
\[
v=(I-\gamma P)^{-1}R
\]
- The computational complexity to solve $v$ is $O(n^3)$ for $n$ states.
- This analytic method is only possible if $n$ is small.
- If $n$ is large, the following iterative methods are used to solve MRPs.
- Dynamic programming
- Monte-Carlo evalution
- Temporal-Difference learning
3. Markov decision process
3.1. Markov decision process (MDP)
A?Markov decision process?(MDP) is a Markov reward process with decisions of actions.
A?Markov decision process?is a tuple $\langle S,A, P, R, \gamma \rangle$ such that
- $S$ is a finite set of states,
- $A$ is a finite set of actions,
- $P$ is a transition probability matrix,
- $P_{ss’}^{a}=\mathbb{P}[S_{t+1}=s’ | S_{t}=s, A_{t}=a]$
- $R$ is a reward function, , and
- A reward function on state $s$ is $R_{s}^{a}=\mathbb{E}[r_{t+1} | S_{t}=s, A_{t}=a]$.
- $r_{t}$ is a reward at time $t$.
- $\gamma$ is a discount factor, $\gamma \in [0,1]$.
The steps?of a MDP [sungjae]
- Be on a state.
- Take an action.
- Gain a reward by the action
- Arrive at a new state.
- Go to step 1
3.2. Policy
A?policy?$\pi$ is a distribution over actions $a$ given states $s$.
\[
\pi (a|s) = \mathbb{P}[A_{t}=a | S_{t}=s]
\]
- A policy defines the action[behavior] of an agent.
- MDP policies only depend on the current state.
- MDP policies are parameterized by states.
- Policies are?stationary[time-independent].
- $A_{t}\sim?\pi(\cdot | S_{t}), \forall{t} > 0$
- stationary $\neq$ dynamic
3.3. Value functions in Markov decision processes
There two value functions in MDPs: [1] the state-value function and [2] action-value function.
State-value function
The?state-value function?$v_{\pi}(s)$ of an MDP is the?expected return?[1] starting from state $s$ and [2] following policy $\pi$.
\[
v_{\pi}(s)=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]
\]
Action-value function
The?action-value function?$q_{\pi}(s,a)$ is the?expected return?[1] starting from state $s$, [2] taking action $a$, and then [3] following policy $\pi$
\[
q_{\pi}(s,a)=\mathbb{E}_{\pi}[G_{t} | S_{t} = s , A_{t} = a]
\]
3.4. Bellman expectation equation in MDP
Bellman expectation equation for $v^{\pi}$
\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s)q_{\pi}(s,a)}
\]
- $s$ is the current state.
- $A$ is the?set of the actions able to be taken on $s$.
- $a\in A$ is an action on $s$.
- $\pi$ is?a particular?policy.
- Conclusion: $v_{\pi}(s)$ can be solved by $q_{\pi}(s,a)$.
Bellman expectation equation for $q^{\pi}$
\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)}
\]
- $R_{s}^{a}$ is the reward function that returns reward if action $a$ is taken on state $s$.
- Conclusion: $q_{\pi}(s,a)$ can be solved by $v_{\pi}(s’)$.
Bellman expectation equation for $v^{\pi}$ (2)
\[
v_{\pi}(s)=\sum_{a\in A}{\pi(a|s) R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{\pi}(s’)} }
\]
- Conclusion: $v_{\pi}(s)$ can be solved by $v_{\pi}(s’)$.
Bellman expectation equation for $q^{\pi}$ (2)
\[
q_{\pi}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} \sum_{a\in A}{\pi(a|s’)q_{\pi}(s’,a)}
\]
- Conclusion: $q_{\pi}(s,a)$ can be solved by $q_{\pi}(s’,a)$.
Bellman expectation equation (matrix form)
\[
v_{\pi} = R^{\pi} + \gamma P^{\pi} v_{\pi}
\]
\[
v_{\pi} = (I – \gamma P^{\pi})^{-1} R^{\pi}
\]
3.5. Optimal value function
- The?optimal value function?specifies the best possible performance in the MDP.
- ‘An MDP is solved‘ when we know the optimal value function.
Optimal state-value function
The?optimal state-value function?is $v_{*}(s)$ is the maximum value function over all policies.
\[
v_{*}(s) = \max_{\pi}{v_{\pi}(s)}
\]
Optimal action-value function
The?optimal action-value function?is $q_{*}(s,a)$ is the maximum value function over all policies.
\[
q_{*}(s,a) = \max_{\pi} q_{\pi}(s,a)
\]
Partial ordering over policies
\[
\pi \geq \pi^{‘} \textup{ if } v_{\pi}(s) \geq?v_{\pi^{‘}}(s), \forall{s}
\]
Theorems
For any Markov decision process
- [★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e.,?\pi \geq \pi^{*}, \forall{\pi}.
- All optimal policies achieve the optimal value function.
- All optimal policies achieve the optimal action function.
Finding an optimal policy
An optimal policy can be constructed by following formulation.
\[
\pi_{*}(a|s)=
\begin{cases}
1 & \text{ if } a = \underset{a \in A}{\arg\max} \ {q_{*}(s,a)} \\
0 & \text{ otherwise }
\end{cases}
\]
3.6. Bellman optimality equation
Bellman optimality equation for $v_{*}$
\[
v_{*}(s)=\underset{a}\max q_{*}(s,a)
\]
Bellman optimality equation for $q_{*}$
\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\]
Bellman optimality equation for $v_{*}$ (2)
\[
v_{*}(s)=\underset{a}\max
\left[
R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a} v_{*}(s’)}
\right]
\]
Bellman optimality equation for $q_{*}$ (2)
\[
q_{*}(s,a) = R_{s}^{a} + \gamma \sum_{s’ \in S}{P_{ss’}^{a}?\underset{a’}\max q_{*}(s’,a’)}
\]
4. Extensions to MDPs
[Our formulation] → [Extension]
- finite states?→ infinite states
- discrete?states?→ continuous states
- fully observable states?→ partially?observable states
- discounted future reward?→ undiscounted future reward
- maximum future reward?→?averaged future reward