Expectation Maximization (EM)

The expectation maximization or EM algorithm is an iterative method for finding maximum likelihood estimates in models with latent or missing variables.

Suppose the observed data are and the unobserved latent variables are . The model defines a joint distribution , but the likelihood for the observed data alone is obtained by marginalizing out the latent variables:

in the discrete case, or

in the continuous case.

The difficulty is that this marginal likelihood can be hard to maximize directly, even when the complete-data likelihood has a simpler form. EM addresses this by alternating between:

  1. estimating the distribution of the latent variables given the current parameter values, and
  2. maximizing the expected complete-data log-likelihood under that distribution.

This makes EM especially useful for latent variable models such as mixture models, hidden-variable models, and other settings where direct optimization of the observed-data likelihood is inconvenient.

The EM Algorithm

Let denote the observed data, the latent variables, and let be the current parameter estimate at iteration . EM defines the auxiliary function

That is, is the conditional expectation of the complete-data log-likelihood, where the expectation is taken with respect to the conditional distribution of the latent variables under the current parameter value .

The EM algorithm seeks to increase the observed-data log-likelihood by iteratively applying these two steps:

Expectation Step (E Step)

In the E step, we compute the conditional distribution of the latent variables given the observed data and the current parameter estimate:

Using this conditional distribution, we form

Intuition

If we actually knew the latent variables , then we could work directly with the complete-data log-likelihood . Since we do not know them, the E step replaces the missing information by averaging over all plausible latent values under the current model.

Maximization Step (M Step)

In the M step, we update the parameter by maximizing the function from the E step:

This produces a new parameter value, which is then used in the next E step.

Intuition

The M step pretends that the expected complete-data log-likelihood is the objective of interest and chooses the parameter value that makes the completed data as plausible as possible on average.

Why EM Works

EM is attractive because each iteration is usually easier than maximizing the observed-data likelihood directly. Under mild conditions, each EM iteration does not decrease the observed-data log-likelihood, so the algorithm generates a sequence of parameter values whose likelihoods improve monotonically.

Warning

EM is not guaranteed to find the global maximum of the likelihood. Like many iterative optimization procedures, it can converge to a local maximum or a saddle point, so initialization can matter.

In practice, EM is often used when:

  1. the model has hidden class assignments or other latent structure,
  2. the complete-data likelihood is easy to optimize, but the marginal likelihood is not, and
  3. we want a principled alternative to ad hoc alternating optimization.

Common Examples

Common applications of EM include:

  1. Gaussian mixture models, where the latent variable indicates which mixture component generated each observation,
  2. models with missing data, where the unobserved entries are treated as latent variables, and
  3. hidden-variable probabilistic models more generally.

Worked Example: Gaussian Mixture Model

Consider a two-component mixture model for one-dimensional data . Introduce a latent variable indicating which component generated observation . Let

and suppose that, conditional on ,

where the variance is assumed known.

The observed-data likelihood is

where and denotes the Normal density. This is difficult to maximize directly because of the sum inside the product.

If we knew the latent assignments , the problem would be much easier.

Complete-Data Log-Likelihood

Let

Then the complete-data log-likelihood is

E Step

In the E step, we compute the conditional probabilities

and

Using Bayes’ theorem,

These quantities are called the responsibilities. They are the probabilities, under the current parameter values, that component 1 generated observation .

Intuition

Instead of assigning each point to exactly one cluster, EM performs a soft assignment. If a point lies between the two components, both components can receive partial responsibility for it.

M Step

In the M step, we maximize the expected complete-data log-likelihood by replacing the unknown indicators with their conditional expectations . The updates become

and

Thus, the new mixing weight is the average responsibility of component 1, and each mean is a weighted average of the data, where the weights are the current responsibilities.

Summary of One Iteration

Starting from initial values :

  1. compute the responsibilities and for each observation,
  2. update using the weighted-average formulas above, and
  3. repeat until the parameter values or the log-likelihood stop changing appreciably.

Sources