LLMs for RL
LLMs can be used to help create agents that themselves may or may not use language. The LLMs can be used for their prior knowledge, their ability to generate code, their reasoning ability, and their ability to perform in-context learning.
LLMs for Pre-Processing the Input
If the input observations
Example: AlphaProof
The AlphaProof system 1 uses an LLM (called the formalizer network) to translate an informal specification of a math problem into the formal Lean representation, which is then passed to an agent (called the solver network) which is trained using the AlphaZero method, to generate proofs inside the Lean theorem proving environment. In this environment, the reward is 0 or 1 (the proof is correct or not), the state space is a structured set of previously proven facts and the current goal, and the action space is a set of proof tactics. The agent itself is a separate transformers policy network.
VLMs for Parsing Images into Structured Data
If the observations are images, it is traditional to use a CNN to process the input, so
One option is the Motif system 2, a pretrained-policy is used to collect trajectories, from which pairs of states
It is also common to use a VLM to define a reward function.
LLMs for Rewards
Its difficult to design a reward function to cause an agent to exhibit some desired behavior. Fortunately LLMs can often help with this task, especially with goal-conditioned RL.
LLMs for World Models
LLMs can also be used to create world models of the form
LLMs as World Models
In principle, it is possible to treat a pre-trained LLM (or other kind of foundation model) as an implicit model of the form
Caution
This rarely works out of the box. However, it can be made to work by suitable pre-training.
LLMs for Generating Code World Models
Calling the LLM at every step to sample from the world model
One approach is to rely on zero-shot prompting of the LLM to generate the CWM just from a text description of the environment, possible combined with feedback that checks the validity of the generated model.
LLMs for Policies
LLMs can be used for creating policies. We can do this by treating the LLM itself as a policy (which is then updated using in-context learning), or asking the LLM to generate some code that represents the policy.
LLMs for Generating Actions
We can sample an action from a policy
Caution
For this to work, the model must be pretrained on state-actions sequences using behavior cloning.
An alternative approach is to enumerate all possible discrete actions, and use the LLM to score them in terms of their likelihoods given the goal, and us their suitability given a learned value function applied to the current state, i.e.
LLMs for Generating Code Policies
Calling the LLM at every step is slow, so an alternative is to use the LLM to generate code that represents (part of) the policy. This is called a code policy. We can also use the LLM as a mutation operator inside of an evolutionary search algorithm, as in the FunSearch system, where the objective is to maximize performance of the generated policy when deployed to one or more environments. 3
In-Context RL
Large LLMs have shown a surprising property known as In-Context Learning or ICL, in which they can be taught to do function approximation just by being given
Sources
- Murphy, K. (2025). Reinforcement Learning: An Overview. Chapter 6.