LLMs for RL

LLMs can be used to help create agents that themselves may or may not use language. The LLMs can be used for their prior knowledge, their ability to generate code, their reasoning ability, and their ability to perform in-context learning.

LLMs for Pre-Processing the Input

If the input observations sent to the agent are in natural language (or some other textual representation such as JSON), it is natural to use an LLM to process them, in order to compute a more compact representation , where can be the hidden state of the last layer of an LLM.

Example: AlphaProof

The AlphaProof system 1 uses an LLM (called the formalizer network) to translate an informal specification of a math problem into the formal Lean representation, which is then passed to an agent (called the solver network) which is trained using the AlphaZero method, to generate proofs inside the Lean theorem proving environment. In this environment, the reward is 0 or 1 (the proof is correct or not), the state space is a structured set of previously proven facts and the current goal, and the action space is a set of proof tactics. The agent itself is a separate transformers policy network.

VLMs for Parsing Images into Structured Data

If the observations are images, it is traditional to use a CNN to process the input, so would be an embedding vector. However, we could alternatively use a VLM to compute a structured representation, where might be a set of tokens describing the scene at a high level, or potentially a JSON dictionary. We can then pass this symbolic representation to a policy function.

One option is the Motif system 2, a pretrained-policy is used to collect trajectories, from which pairs of states are selected at random. The LLM is then asked which state is preferable, thus generating training tuples, which can then be used to train a binary classifier from which a reward model is extracted.

It is also common to use a VLM to define a reward function.

LLMs for Rewards

Its difficult to design a reward function to cause an agent to exhibit some desired behavior. Fortunately LLMs can often help with this task, especially with goal-conditioned RL.

LLMs for World Models

LLMs can also be used to create world models of the form , which we denote by for brevity. We can do this by treating the LLM itself as the world model (which is then updated using in-context learning), or asking the LLM to generate another artefact, such as some python code, that represents the world model. The advantage of the latter approach is that the resulting world model will be much faster to run, and may be more interpretable.

LLMs as World Models

In principle, it is possible to treat a pre-trained LLM (or other kind of foundation model) as an implicit model of the form by sampling responses to a suitable prompt, which encodes and .

Caution

This rarely works out of the box. However, it can be made to work by suitable pre-training.

LLMs for Generating Code World Models

Calling the LLM at every step to sample from the world model is very slow, so an alternative is to use the LLM to generate code that represents the world model. This is called a code world model (CWM).

One approach is to rely on zero-shot prompting of the LLM to generate the CWM just from a text description of the environment, possible combined with feedback that checks the validity of the generated model.

LLMs for Policies

LLMs can be used for creating policies. We can do this by treating the LLM itself as a policy (which is then updated using in-context learning), or asking the LLM to generate some code that represents the policy.

LLMs for Generating Actions

We can sample an action from a policy by using an LLM, where the input context contains the past data , and then the output token is interpreted as the action.

Caution

For this to work, the model must be pretrained on state-actions sequences using behavior cloning.

An alternative approach is to enumerate all possible discrete actions, and use the LLM to score them in terms of their likelihoods given the goal, and us their suitability given a learned value function applied to the current state, i.e. , where is the current goal, is a text description of action , and is the value function for action .

LLMs for Generating Code Policies

Calling the LLM at every step is slow, so an alternative is to use the LLM to generate code that represents (part of) the policy. This is called a code policy. We can also use the LLM as a mutation operator inside of an evolutionary search algorithm, as in the FunSearch system, where the objective is to maximize performance of the generated policy when deployed to one or more environments. 3

In-Context RL

Large LLMs have shown a surprising property known as In-Context Learning or ICL, in which they can be taught to do function approximation just by being given pairs in their context (prompt). This can be used to train LLMs without need to do any gradient updates to the underlying parameters. 4

Sources

  • Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 6.

Footnotes

  1. AlphaProof

  2. Motif: Intrinsic Motivation from Artificial Intelligence Feedback

  3. FunSearch: Making new discoveries in mathematical sciences using Large Language Models

  4. A Survey of In-Context Reinforcement Learning