Lecture 2: Markov Decision Processes | Reinforcement Learning | David Silver | Course

1. Markov Process / Markov chain

1.1. Markov process

Markov process or Markov chain is a tuple such that

• is a finite set of states, and
• is a transition probability matrix.

In a  Markov process, the initial state should be given. How do we choose the initial state is not a role of the Markov process.

1.2. State transition probability

state transition probability from to is

Episode

An episode of a Markov process is a trajectory of the transition states.

2. Markov reward process

2.1. Markov reward process (MRP)

Markov reward process is a tuple such that

• is a finite set of states,
• is a transition probability matrix,
• is a reward function, , and
• A reward function on state is .
• is a reward at time .
• is a discount factor, .
The steps of a MDP [sungjae]
1. Be on a state.
2. Move to a new state.
3. Gain a reward on arriving at the state.
4. Go to step 1
2.2. Return

The return at time-step is denoted by .
is the total discounted reward from time-step .

2.3. Value function in MRP
State value function

The state value function of a Markov reward process is the expected return starting from state .

2.4. Bellman equation for MRP
Bellman equation for Markov reward processes

The Bellman equation for MRPs implies that can be represented by the multiple .

Solving the Bellman equation

• The computational complexity to solve is for states.
• This analytic method is only possible if is small.
• If is large, the following iterative methods are used to solve MRPs.
1. Dynamic programming
2. Monte-Carlo evalution
3. Temporal-Difference learning

3. Markov decision process

3.1. Markov decision process (MDP)

Markov decision process (MDP) is a Markov reward process with decisions of actions.

Markov decision process is a tuple such that

• is a finite set of states,
• is a finite set of actions,
• is a transition probability matrix,
• is a reward function, , and
• A reward function on state is .
• is a reward at time .
• is a discount factor, .
The steps of a MDP [sungjae]
1. Be on a state.
2. Take an action.
3. Gain a reward by the action
4. Arrive at a new state.
5. Go to step 1
3.2. Policy

policy  is a distribution over actions given states .

• A policy defines the action[behavior] of an agent.
• MDP policies only depend on the current state.
• MDP policies are parameterized by states.
• Policies are stationary[time-independent].
• stationary dynamic
3.3. Value functions in Markov decision processes

There two value functions in MDPs: [1] the state-value function and [2] action-value function.

State-value function

The state-value function  of an MDP is the expected return [1] starting from state and [2] following policy .

Action-value function

The action-value function  is the expected return [1] starting from state , [2] taking action , and then [3] following policy

3.4. Bellman expectation equation in MDP
Bellman expectation equation for

• is the current state.
• is the set of the actions able to be taken on .
• is an action on .
• is a particular policy.
• Conclusion: can be solved by .
Bellman expectation equation for

• is the reward function that returns reward if action is taken on state .
• Conclusion: can be solved by .
Bellman expectation equation for (2)

• Conclusion: can be solved by .
Bellman expectation equation for (2)

• Conclusion: can be solved by .

3.5. Optimal value function
• The optimal value function specifies the best possible performance in the MDP.
• An MDP is solved‘ when we know the optimal value function.
Optimal state-value function

The optimal state-value function is is the maximum value function over all policies.

Optimal action-value function

The optimal action-value function is is the maximum value function over all policies.

Theorems

For any Markov decision process

• [★] There exists an optimal policy \pi_{*} that is better than or equal to all other policies, i.e., \pi \geq \pi^{*}, \forall{\pi}.
• All optimal policies achieve the optimal value function.
• All optimal policies achieve the optimal action function.
Finding an optimal policy

An optimal policy can be constructed by following formulation.

4. Extensions to MDPs

[Our formulation] → [Extension]

• finite states → infinite states
• discrete states → continuous states
• fully observable states → partially observable states
• discounted future reward → undiscounted future reward
• maximum future reward → averaged future reward