RL for LLMs

This note talks about how to use RL to improve the performance of LLMs

RL Fine Tuning (RLFT)

LLMs are usually trained with behavior cloning, i.e., MLE on a fixed dataset, such as a large text corpus scaped from the web. This is called pre-training. We can then improve the performance using various post-training methods, which are designed to improve their capabilities and alignment with human preference.

A simple way to perform post-training is to use instruction fine tuning, also called supervised fine-tuning (SFT), in which we collect human demonstrations of (prompt, response) pairs, and fine-tune the model on them. However, it is very difficult to collect sufficient quantities of such data.

An alternative to demonstrating good behaviors is to use RL to train the model using a suitable reward function. This is called reinforcement learning fine-tuning or RLFT. This can be preferential to SFT for several reasons:

  • It is often the case that verification is easier than generation
  • RL can be used to learn a set of thinking actions, which are created in response to the question before generating the answer. For complex problems (e.g. in math), this tends to work much better than trying to directly learn an input-output mapping. 1 It is possible to use SFT on explicitly provided thinking traces, but it has been found that RL can generalize more reliably. 2
  • Finally, RL opens the path to super-human performance, going beyond whatever supervised examples humans can create. 3

Reward Models

We now look at various kinds of reward functions for RLFT.

RL with Verifiable Rewards (RLVR)

For problems like math and coding, it can be easy to determine if an answer is correct, by checking equality between the generated answer and the true answer (for math), or checking if a set of unit tests pass (for code). This allows us to define a binary reward signal. Using RL with such a reward is called RL with Verifiable Rewards or RLVR.

Process vs Outcome Reward Models

If the reward function is defined on partial trajectories, it is called a process reward model or PRM. This provides a form of dense feedback. If the reward is just defined on the final sequence , it is called an outcome reward model or ORM, and corresponds to a sparse reward.

Learning the Reward Model from Human Feedback (RLHF)

To train LLMs to do well in general tasks, such as text summarization or poetry writing, it is common to use reinforcement learning from human feedback or RLHF, which refers to learning a reward model from human data, and then using RL to train the LLM to maximize this.

The basic idea is as follows:

  • We generate a large number of (context, answer1, answer2) tuples either by a human or an LLM
  • We ask human raters if they prefer answer1 or answer2
  • Let be the prompt (context), be the winning (preferred) output, and be the losing output. Let be the reward assigned to output . This model is typically a shallow MLP on top of the last layer of a pretrained LLM.
  • We train the reward model by maximizing the likelihood of the observed preference data. The likelihood function is given by the Bradley Terry choice model

Under the Bradley-Terry model, this gives the probability

and the maximum likelihood objective is therefore

Equivalently, we can minimize the negative log-likelihood

Learning the Reward Model from AI Feedback (RLAIF)

Instead of asking humans their preferences for each possible input example, we can ask an LLM to predict the preference. This is often called LLM as judge. We can then fit the reward model to this synthetically labeled data, just as in RLHF. More broadly, using AI-generated preference labels or reward signals in place of human feedback is called reinforcement learning from AI feedback (RLAIF). It is common to use VLMs for RLAIF.

Agents Which Think

In this section, we discuss how to leverage the power of LLMs to create agents that think before they act.

Chain of Thought Prompting

The quality of the output from an LLM can be improved by prompting it to show its work before presenting the final answer. These intermediate tokens are called a Chain of Thought. 4 Various theoretical results show that CoT can significantly improve the expressive power of transformers. 5

Training a Thinking Model using RL

Rather than relying on prompting, we can explicitly train a model to think by letting it generate a variable number of tokens in its head before generating the final answer. Only the final outcome is evaluated, using a known reward function (as in the case of math and coding problems).

This approach was recently demonstrated by the DeepSeek-R1-Zero system. 6 They start with a strong LLM base model (known as DeepSeek-V3-Base), which was pretrained on a large variety of data (including Chains of Thoughts). They then used a variant of PPO, known as GRPO, to do RLFT using a set of math and coding benchmarks where the ground truth answer is known.

Can We Bootstrap a Model to Think From Scratch

One reason DeepSeek-R1 got so much attention in the press is that during the training process, it seemed to spontaneously exhibit some emergent abilities such as generating increasingly long sequences of thoughts, and using self-reflection to refine its thinking, before generating the final answer.

The claim that RL has caused these emergent abilities has been disputed by many authors. 78 Instead, the general consensus is that the base model itself was already trained on datasets that contained some CoT-style reasoning patterns. This is consistent with the findings which showed that applying RL to a base model that had not been pretrained on reasoning patterns (such as self-reflection) did not result in a final model that could exhibit such behaviors. 9

Tip

RL can expose or amplify such abilities in a base model if they are already present to a certain extent. The general consensus is that applying RL to a base model that had not been pretrained on reasoning cannot cause these abilities to emerge.

Agentic AI

There is currently a lot of hype around Agentic AI systems, which consist of a set of interacting LLMs, often called agents, which are essentially different prompts, reflecting different roles or personas, which can be given to the shared LLM to make it act in different ways. Typically these prompts, and the way the different agents interact, are hand-designed. This is called workflow or scaffolding.

Algorithms for Single-Turn RL

This section discusses RL methods for training LLMs to solve math and reasoning problems. In this setup, there is just a single state, namely the input prompt , the action is a sequence of tokens generated by the policy in response, and then the game ends. This is equivalent to a contextual bandit problem, with sequence-valued input (context) and output (action).

Problem Setup

Formally, the goal is to maximize

where is the context/prompt (sampled from the data), and is the generated sequence of actions (tokens) sampled from the policy

Here, is the length of the generated output (which is terminated by generating an <eos> token).

We can convert this to an MDP by defining the following deterministic state transitions

with initial distribution . Thus the state is just the set of tokens from the initial prompt plus the generated tokens up until time . This definition of the state restores the Markov property, and allows us to write the policy in the usual way as .

Tip

In practice, the above approach can overfit to the reward function, so we usually regularize the problem to ensure the policy remains close to the base pretrained LLM . We can do this by adding a penalty to the per-token reward .

PPO

A natural approach to training the LLM policy is to use PPO. In the bandit case, we can write the objective as follows

where

GRPO

Learning an actor (policy) and value function (critic) takes twice as much time and memory as just learning a policy. This is problematic for large LLMs. Therefore, there have been a bunch of recent methods that replace the parametric value function with MC estimates.

Caution

The disadvantage here is that estimating the value with MC rollouts requires that we can reset the environment, so that we can generate multiple responses (action trajectories) given the same initial state. This is fine for question answering, but much harder for multi-turn RL, which is discussed below. Also, there is no credit assignment to intermediate states, since we are only estimating the value of the initial state. Thus the method is statistically quite inefficient.

The Group Relative PPO or GRPO algorithm 10 is a variant of PPO which replaced the critic network with a Monte Carlo estimate of the value function. For each prompt , we generate answers (called a group, often of size ) which give final rewards . We then compute the advantage by subtracting the group average and dividing by the group standard deviation:

where and . The use of the normalization term ensures the rewards are calibrated, so that a hard problem with low average reward may still result in an update if the deviation from this low mean is large.

Since the policy generates a sequence, we can expand out the loss for each sequence into a sum of per-token losses. In GRPO, we set the step-level advantage to be equal to the normalized trajectory-level advantage, , and we define the likelihood ratio as

We then normalize the clipped advantage by the length of each sequence, to ensure that long sequences don’t dominate the loss

DAPO

In the DAPO paper 11, they suggest an asymmetric clipping of the likelihood ratio term

In particular, they suggest in the clipping term, so that actions which are low-probability under the previous model, and which therefore get large likelihood ratio , are not clipped as much, since this would suppress exploration and result in entropy collapse. On the other hand, the smaller lower clipping value makes it harder for the policy to reduce action probabilities too aggressively in a single update, which helps prevent the policy from prematurely eliminating diverse behaviors.

GSPO

In the GSPO (Group Sequence Policy Optimization) paper 12, they point out a flaw with GRPO, due to the fact that the importance sampling correction is applied to each token, even though the reward is evaluated at the trajectory level. This can result in unstable training. They therefore propose to use the following sequence-level objective, following the contextual bandit formulation:

where the importance ratio is given by the ratio of sequence-level likelihoods, where they also normalize by length to ensure the magnitude (and hence clipping value) is comparable across sequences

Dr GRPO

It turns out that the division by the standard deviation used by GRPO when normalizing the advantage terms induces a difficulty bias, in which a very easy or hard prompt may have a low group-level standard deviation of the corresponding rewards, and dividing by a small can result in unstable gradients. The problem can be solved using the Dr GRPO (GRPO Done Right) method 7, where they just drop the denominator, giving

Algorithms for Multi-Turn RL

The previous section considered the bandit setting, in which a single prompt is presented, and then the agent optionally generates some thinking tokens, followed by the answer tokens, and then immediately gets a reward. Another class of methods train agents that can interact with an external environment, which enables tool use or dialog agents. The difference from the standard contextual bandit LLM reasoning setting is that the effect of an action on the external environment is typically unknown, may be stochastic, and the reward may be delayed. In addition, the external environment is often stateful, so actions may be irreversible. Training agents in this setting requires true multi-turn RL methods.

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 6.

Footnotes

  1. Why think step by step? Reasoning emerges from the locality of experience

  2. SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

  3. Welcome to the Era of Experience

  4. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

  5. The Expressive Power of Transformers with Chain of Thought

  6. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

  7. Understanding R1-Zero-Like Training: A Critical Perspective 2

  8. Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?

  9. Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

  10. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

  11. DAPO: An Open-Source LLM Reinforcement Learning System at Scale

  12. Group Sequence Policy Optimization