Markov Decision Process

From Canonica AI

Introduction

A Markov Decision Process (MDP) is a mathematical framework used in AI and operations research for modeling decision-making scenarios where outcomes are partly random and partly under the control of a decision-maker. MDPs provide a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. MDPs are useful for studying optimization problems solved via dynamic programming and reinforcement learning.

A visual representation of a Markov Decision Process with states, actions, and transitions.
A visual representation of a Markov Decision Process with states, actions, and transitions.

Definition

Formally, a Markov Decision Process is a 4-tuple (S, A, P, R), where:

- S is a finite set of states, - A is a finite set of actions, - P is a state transition probability matrix, - R is a reward function.

State Transition Probability

The state transition probability, denoted as P, defines the probability of landing in a particular state given the current state and action. It is represented as P(s'|s, a), where s' is the new state, s is the current state, and a is the action taken.

Reward Function

The reward function R(s, a, s') gives the immediate reward received after transitioning from state s to state s', due to action a. The reward function can be deterministic or stochastic.

Policies

A policy, denoted as π, is a rule that the decision-maker follows in choosing its actions. A policy maps a state to the probabilities of selecting each possible action from that state. The policy space is the set of all policies.

Value Functions

The value function V(s) of a state s under a policy π is the expected return from state s when actions are chosen according to π. The action-value function Q(s, a) under a policy π is the expected return after taking action a in state s and thereafter following π.

Bellman Equations

The Bellman equations provide recursive definitions for the value functions. They are central to the theory of MDPs and are used in algorithms for computing optimal policies.

Solving MDPs

MDPs are typically solved using dynamic programming techniques such as value iteration or policy iteration. Other methods include linear programming and reinforcement learning algorithms such as Q-learning.

Applications

MDPs find applications in various fields such as economics, game theory, control theory, and artificial intelligence. They are particularly useful in the study of optimization problems.

See Also

- Reinforcement Learning - Dynamic Programming - Game Theory - Control Theory