Diffusion Models

Like normalizing flows, diffusion models are probabilistic models that define a nonlinear mapping from latent variables to the observed data where both quantities have the same dimension. Like variational autoencoders, they approximate the data likelihood using a lower bound based on an encoder that maps to the latent variable. However, in diffusion models, this encoder is predetermined; the goal is to learn a decoder that is the inverse of this process and can be used to produce samples. Diffusion models are easy to train and can produce very high-quality samples.

Overview

A diffusion model consists of an encoder and a decoder. The encoder takes a data sample and maps it through a series of intermediate latent variables . The decoder reveres this process; it starts with and maps back through until it finally recreates a data point . In both encoder and decoder, the mappings are stochastic rather than deterministic.

The encoder is prespecified; it gradually blends the input with samples of white noise. With enough steps, the conditional distribution and marginal distribution of the final latent variable both become the standard normal distribution. Since this process is prespecified, all the learned parameters are in the decoder.

In the decoder, a series of networks are trained to map backward between each adjacent pair of latent variables and . The loss function encourages each network to invert the corresponding encoder step. The result is that noise is gradually removed from the representation until a realistic looking data example remains. To generate a new data example , we draw a sample from and pass it through the decoder.

Encoder (Forward Process)

Definition (Forward Process)

The diffusion or forward process maps a data example through a series of intermediate variables with the same size as according to

where is noise drawn from a standard normal distribution. The first term attenuates the data plus any noise added so far, and the second term adds more noise. The hyperparameters determine how quickly the noise is blended and are collectively known as the noise schedule.

The forward process can be equivalently written as

This is a Markov chain because the probability of is determined entirely be the value of the immediately preceding variable .

Tip

With sufficient steps , all traces of the original data are removed, and becomes a standard normal distribution.

The joint distribution of all the latent variables given input is

Diffusion Kernel

To train the decoder to invert this process, we use multiple samples at time for the same example . However, generating these sequentially using the above is time consuming when is large. Fortunately, there is a closed-form expression for , which allows us to directly draw samples given initial data point without computing the intermediate variables .

Definition (Diffusion Kernel)

The diffusion kernel is given by

where . We can equivalently write this in probabilistic form:

To see why, consider the first two steps of the forward process:

Substituting the first into the second yields

The last two terms are independent samples from mean-zero normal distributions with variances and respectively. The mean of this sum is zero, and its variance is the sum of the component variances, so

where is also a sample from a standard normal distribution.

Marginal Distribution

The marginal distribution is the probability of observing a value of given the distribution of possible starting points and the possible diffusion paths for each starting point. Since we have an expression for the diffusion kernel that skips the intervening variables, we can write

Hence, if we repeatedly sample from the data distribution and superimpose the diffusion kernel on each sample, the result is the marginal distribution . However, the marginal distribution cannot be written in closed form because we don’t know the original data distribution .

Conditional Distribution

We defined the conditional probability as the mixing process (Equations 1 and 2). To reverse this process, we apply Bayes theorem:

This is intractable since we cannot compute the marginal distribution .

Conditional Diffusion Distribution

We noted above that we could not find the conditional distribution because we do not know the marginal distribution . However, if we know the starting variable , then we do know the distribution at the time before. This is just the diffusion kernel, and it is normally distributed.

Hence, it is possible to compute the conditional diffusion distribution in closed form. This distribution is used to train the decoder. It is the distribution over when we know the current latent variable and the training data example (which we do when training).

Definition (Conditional Diffusion Distribution)

The conditional diffusion distribution is given by

Decoder Model (Reverse Process)

When we learn a diffusion model, we learn the reverse process. In other words, we learn a series of probabilistic mappings back from latent variable until we reach the data . The true reverse distributions of the diffusion process are complex multi-modal distributions that depend on the data distribution . We approximate these as normal distributions

where is a neural network that computes the mean of the normal distribution in the estimated mapping from to the preceding latent variable . The terms are predetermined. If the hyperparameters in the diffusion process are close to zero (and the number of time steps is large), then this normal approximation will be reasonable.

We generate new examples from using ancestral sampling. We start by drawing from . Then we sample from , and so on until we finally generate from .

Training

The joint distribution of the observed variable and the latent variables is

The likelihood of the observed data is found my marginalizing over the latent variables

To train the model, we maximize the log-likelihood of the training data with respect to the parameters :

We can’t maximize this directly because the marginalization in Equation 13 is intractable. Hence, we use Jensen’s inequality to define a lower bound on the likelihood and optimize the parameters with respect to this bound exactly as we did for the VAE.

Evidence Lower Bound (ELBO)

To derive the lower bound, we multiply and divide the log-likelihood by the encoder distribution and apply Jensen’s inequality:

This gives us the evidence lower bound (ELBO)

Tip

The key idea is to ask what distribution the expectation should be over, rather than what distribution we should multiply and divide by. Since are latent variables and is observed, the natural choice is a distribution over latent explanations conditioned on that observation, i.e. something of the form .

In the VAE, the encoder approximates the posterior distribution over the latent variables to make the bound tight, and the decoder maximizes this bound. The ideal choice would be the true posterior , because then the bound would be exact, but that quantity is intractable. In diffusion models, the decoder must do all the work since the encoder has no parameters. It makes the bound tighter by both (i) changing its parameters so that the static encoder does approximate the posterior and (ii) optimizing its own parameters with respect to that bound.

Simplified ELBO

It can be shown that a simplified approximation to Equation 22 is given by

Tip

This form is useful because it decomposes the global ELBO into local denoising objectives. Instead of reasoning about the entire latent trajectory at once, each term trains one reverse transition to match the tractable conditional diffusion distribution . Since both distributions are Gaussian, the KL divergence can be computed directly. The expectation in the second term is over noisy samples ; the KL divergence itself integrates over possible values of . In practice, this means we can sample a timestep , generate directly from the diffusion kernel, and train the model to remove the corresponding amount of noise.

The first term was previously defined in Equation 17 as

and this is equivalent to the reconstruction term in the VAE. The ELBO will be larger if the model prediction matches the observed data. As for the VAE, we will approximate the expectation over the log of this quantity using a Monte Carlo estimate, in which we estimate the expectation with a sample from .

The KL Divergence terms in the ELBO measure the distance between and , which were defined in Equations 16 and 14 respectively and the definition of the diffusion kernel:

The KL divergence between two normal distributions has a closed-form expression. Moreover, many of the terms in this expression do not depend on , and the expression simplifies to the squared difference between the means plus a constant :

Diffusion Loss Function

To fit the model, we maximize the ELBO with respect to the parameters . We recast this as a minimization by multiplying with minus one and approximating the expectations with samples to give the loss function. In this expression, denotes the value of the normal density with mean and covariance evaluated at .

where is the th data point, and is the associated latent variable at diffusion step .

Training Procedure

This loss function can be used to train a network for each diffusion step. It minimizes the difference between the estimate of the hidden variable at the previous time step and the most likely value that it took given the ground truth de-noised data .

Reparameterization of the Loss Function

Although the loss function in Equation 28 can be used, diffusion models have been found to work better with a different parameterization; the loss function is modified so that the model aims to predict the noise that was mixed with the original data example to create the current variable.

Reparameterization of the Target

The original diffusion update was given by

It follows that the data term in Equation 27 can be expressed as the diffused image minus the noise that was added to it

Substituting this into the target terms from Equation 28 gives

Substituting back into the loss function gives

Reparameterization of the Target Network

We now replace the model with a new model , which predicts the noise that was mixed with to create :

Substituting into Equation 32 and simplifying yields

Rewriting the log normal as a least squares loss plus a constant

Substituting the definition of and from Equations 30 and 33, simplifying, and dropping gives

In practice, the scaling factors (which might be different at each time step) are ignored, giving an even simpler formulation

Implementation

This leads to a straightforward algorithm for both training the model and sampling.

Algorithm 1 Diffusion Model Training

Input: Training data x\mathbf{x}

Output: Model Parameters ϕt\mathbf{\phi}_t

repeat

for iBi \in \mathcal{B} do

tUniform(1,T)t \sim \mathrm{Uniform}(1, T)

εi,tN(0,I)\varepsilon_{i,t} \sim \mathcal{N}(0, \mathbf{I})

i=gt[αtxi+1αtεi,t, ϕt]εi,t2\ell_i = \Big\Vert \mathbf{g}_t \Big[ \sqrt{\alpha_t} \mathbf{x}_i + \sqrt{1-\alpha_t} \varepsilon_{i,t}, \ \mathbf{\phi}_t \Big] - \varepsilon_{i,t} \Big\Vert^2

end for

Accumulate losses for batch and take gradient step

until converged

T = 300
betas = linear_beta_schedule(timesteps=T)
alphas = torch.cumprod(1.0 - betas, axis=0)
alphas_prev = F.pad(alphas[:-1], (1, 0), value=1.0)
 
def forward_diffusion_sample(x, t):
    noise = torch.randn_like(x)
    alphas_t = get_index_from_list(alphas, t, x.shape)
    return torch.sqrt(alphas_t) * x + torch.sqrt(1.0 - alphas_t) * noise, noise  # Eq. 6 / Eq. 29
 
for epoch in range(epochs):
    for step, x in enumerate(dataloader):
        x = x.to(device)
        optimizer.zero_grad()
        # Sample random step between [0, T]
        t = torch.randint(0, T, (BATCH_SIZE,), device=device).long()
        # Get noisy observation and noise applied after t forward steps
        x_noisy, noise = forward_diffusion_sample(x, t)
        # Get predicted noise that was added from x to x_noisy
        noise_pred = model(x_noisy, t)
        loss = F.l1_loss(noise_pred, noise)
        loss.backward()
        optimizer.step()

Success

The training algorithm has the advantages that it is (i) simple to implement, and (ii) naturally augments the dataset; we can reuse every original datapoint as many times as we want at each time step with different noise instantiations .

Algorithm 2 Sampling

Input: Model gt[,ϕt]\mathbf{g}_t[\bullet, \mathbf{\phi}_t]

Output: Sample x\mathbf{x}

zTN(0,I)\mathbf{z}_T \sim \mathcal{N}(0, \mathbf{I})

for t=T2t = T \dots 2 do

z^t1=11βtztβt1αt1βtgt[zt,ϕt]\hat{\mathbf{z}}_{t-1} = \frac{1}{\sqrt{1-\beta_t}} \mathbf{z}_t - \frac{\beta_t}{\sqrt{1-\alpha_t}\sqrt{1-\beta_t}} \mathbf{g}_t [\mathbf{z}_t, \mathbf{\phi}_t]

εN(0,I)\varepsilon \sim \mathcal{N}(0, \mathbf{I})

σt=βt(1αt1)1αt\sigma_t = \sqrt{\frac{\beta_t(1-\alpha_{t-1})}{1-\alpha_t}} from the variance term in q(zt1zt,x)q(\mathbf{z}_{t-1} \mid \mathbf{z}_t, \mathbf{x}) (Eq. 14 / Eq. 26)

zt1=z^t1+σtε\mathbf{z}_{t-1} = \hat{\mathbf{z}}_{t-1} + \sigma_t \varepsilon

end for

x=11β1z1β11α11β1g1[z1,ϕ1]\mathbf{x} = \frac{1}{\sqrt{1-\beta_1}} \mathbf{z}_1 - \frac{\beta_1}{\sqrt{1-\alpha_1}\sqrt{1-\beta_1}} \mathbf{g}_1 [\mathbf{z}_1, \mathbf{\phi}_1]

Caution

The sampling algorithm has the disadvantage that it requires serial processing of many neural networks and is hence time-consuming.

Conditional Generation

If the data has associated labels , these can be exploited to control the generation. One option is to incorporate class information into the main model . In practice, this usually takes the form of adding an embedding based on to the layers of the U-Net in a similar way to how the time step is added. This model is jointly trained on conditional and unconditional objectives by randomly dropping the class information during training.

Implementation Considerations

Several implementation choices have an outsized effect on whether diffusion models train stably and produce coherent samples.

  • Use an architecture that preserves high- and low-resolution information. In image models, a U-Net with skip connections is common because the deeper layers can model global structure while the shallower layers preserve spatial detail.
  • Residual blocks are often helpful. Repeated denoising steps make optimization sensitive to activation scale and gradient flow, so residual connections often improve stability.
  • The normalization layer is not just a minor detail. Different normalizing layers preserve different kinds of information and interact differently with the sampling procedure.
  • Batch normalization can work well when minibatches are large and stable, since it uses batch-level statistics and often makes optimization easier. However, its behavior differs between training and evaluation because sampling uses stored running statistics instead of the current batch.
  • Group normalization is often preferred in diffusion architectures because it does not depend on minibatch composition and behaves more consistently between training and evaluation. In practice, however, it often works best as part of a broader architectural package with residual blocks, timestep conditioning at many layers, and a training recipe tuned for it.
  • Sampling is unusually sensitive to normalization mismatches. The denoiser is applied many times in sequence, so small errors in feature scaling or calibration can accumulate over the reverse process and lead to blurry or washed-out samples.
  • The training and sampling regimes should match as closely as possible. If a model relies strongly on batch-dependent behavior, then switching between training-time and evaluation-time normalization can noticeably change sample quality.
  • The timestep embedding should be injected throughout the network rather than only once at the input. Different layers operate at different spatial scales and benefit from being able to condition on the noise level in their own feature space.
  • Visualization can be misleading if different quantities are compared. The current reverse-chain sample , the predicted previous latent , and the predicted clean image are related but distinct objects, and they may have very different visual quality early in training.
  • Paper-standard components are not always the easiest way to build a first prototype. A simpler model with a stable training recipe is often better for debugging the diffusion equations, the noising process, and the sampler before moving to a more faithful large-scale architecture.

Sources

  • Prince, S. (2023). Understanding Deep Learning. Chapter 18.