Reinforcement learning is a class of methods for solving various kinds of sequential decision making tasks. In such tasks, we want to design an agent that interacts with an external environment. The agent maintains an internal state , which it passes to its policy to choose an action . The environment responds by sending back an observation , which the agent uses to update its internal state using the state-update function .
To simplify things, we often assume that the environment is also a Markovian process, which has internal world state , from which the observations are derived (See POMDP below). We often simplify things even more by assuming that the observation reveals the hidden environment state; in this case, we denote the internal agent state and external environment state by the same letter, namely (also called an MPD, see below).
Maximum Expected Utility Principle
The goal of the agent is to choose a policy so as to maximize the sum of expected rewards
where is the agent’s initial state, is the reward function that the agent uses to measure the value of performing an action in a given state, and is the value function for policy evaluated at .
Definition (Optimal Policy)
We define the optimal policy as
Episodic vs Continual Tasks
If the agent can potentially interact with the environment forever, we call it a continual task. In this case, we replace the sum of rewards (when defining the value function) with the average reward.
Alternatively, we say the agent is in an episodic task if its interaction terminates one the system enters a terminal state or absorbing state, which is a state which transitions to itself with 0 reward. After entering a terminal state, we may start a new episode from a new initial world state .
Reward
Definition (Return)
We define the return for a state at time to the the sum of expected rewards obtained going forwards, where each reward is multiplied by a discount factor
where is the reward, and is the reward-to-go. For episodic tasks that terminate at time , we define for .
The return satisfies the following recursive relationship
The discount factor plays two roles:
It ensures the return is finite, even if , provided we use and the rewards are bounded.
It puts more weight on short-term rewards, which generally has the effect of encouraging the agent to achieve its goals more quickly.
Value Function
Definition (Value Function)
Furthermore, we define the value function to be the expected reward-to-go
Universal Model
A generic representation for sequential decision making problems assumes that the environment can be modeled by a controlled Markov process with hidden state , which gets updated at each step in response to the agents action . To allow non-deterministic dynamics, we write this as , where is the environment’s state transition function (which is usually not known to the agent), and is random system noise. The agent does not see the world state , but instead sees a potentially noisy and/or partial observation at each step, where is random observation noise.
The agent needs to use these observations to maintain an internal belief state about the world, denoted by . This gets updated using the state update function
In the simplest setting, can just store all the past observations , but such non-parametric models can take a lot of time and space to work with, so we usually consider parametric approximations.
Canonical Models
Partially Observed MDPs (POMDP)
Typically the environment’s dynamics model is represented by a stochastic transition function, rather than a deterministic transition function with noise as input.
Markov Decision Process (MDPs)
A Markov decision process is a special case of a POMDP in which the environment states are observed, so . We usually define an MDP in terms of the state transition matrix induced by the world model:
In lieu of an observation model, we assume the environment (as opposed to the agent) send out a reward signal, sampled from . The expected reward is then given by
Given a stochastic policy , the agent can interact with the environment over many steps. Each step is called a transition, and consists of the tuple , where , , and . Hence, under policy , the probability of generating a trajectory length , , can be written explicitly as
Goal-Conditioned MDPs
A goal-conditioned MDP is one in which the reward is defined as if the goal state is achieved. We can also define a dense reward signal using some state abstraction function , by defining , where is some kind of similarity metric.
Sources
Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 1.