Expectation Maximization (EM)
The expectation maximization or EM algorithm is an iterative method for finding maximum likelihood estimates in models with latent or missing variables.
Suppose the observed data are
in the discrete case, or
in the continuous case.
The difficulty is that this marginal likelihood can be hard to maximize directly, even when the complete-data likelihood
- estimating the distribution of the latent variables given the current parameter values, and
- maximizing the expected complete-data log-likelihood under that distribution.
This makes EM especially useful for latent variable models such as mixture models, hidden-variable models, and other settings where direct optimization of the observed-data likelihood is inconvenient.
The EM Algorithm
Let
That is,
The EM algorithm seeks to increase the observed-data log-likelihood by iteratively applying these two steps:
Expectation Step (E Step)
In the E step, we compute the conditional distribution of the latent variables given the observed data and the current parameter estimate:
Using this conditional distribution, we form
Intuition
If we actually knew the latent variables
, then we could work directly with the complete-data log-likelihood . Since we do not know them, the E step replaces the missing information by averaging over all plausible latent values under the current model.
Maximization Step (M Step)
In the M step, we update the parameter by maximizing the function from the E step:
This produces a new parameter value, which is then used in the next E step.
Intuition
The M step pretends that the expected complete-data log-likelihood is the objective of interest and chooses the parameter value that makes the completed data as plausible as possible on average.
Why EM Works
EM is attractive because each iteration is usually easier than maximizing the observed-data likelihood directly. Under mild conditions, each EM iteration does not decrease the observed-data log-likelihood, so the algorithm generates a sequence of parameter values whose likelihoods improve monotonically.
Warning
EM is not guaranteed to find the global maximum of the likelihood. Like many iterative optimization procedures, it can converge to a local maximum or a saddle point, so initialization can matter.
In practice, EM is often used when:
- the model has hidden class assignments or other latent structure,
- the complete-data likelihood is easy to optimize, but the marginal likelihood is not, and
- we want a principled alternative to ad hoc alternating optimization.
Common Examples
Common applications of EM include:
- Gaussian mixture models, where the latent variable indicates which mixture component generated each observation,
- models with missing data, where the unobserved entries are treated as latent variables, and
- hidden-variable probabilistic models more generally.
Worked Example: Gaussian Mixture Model
Consider a two-component mixture model for one-dimensional data
and suppose that, conditional on
where the variance
The observed-data likelihood is
where
If we knew the latent assignments
Complete-Data Log-Likelihood
Let
Then the complete-data log-likelihood is
E Step
In the E step, we compute the conditional probabilities
and
Using Bayes’ theorem,
These quantities are called the responsibilities. They are the probabilities, under the current parameter values, that component 1 generated observation
Intuition
Instead of assigning each point to exactly one cluster, EM performs a soft assignment. If a point lies between the two components, both components can receive partial responsibility for it.
M Step
In the M step, we maximize the expected complete-data log-likelihood by replacing the unknown indicators
and
Thus, the new mixing weight is the average responsibility of component 1, and each mean is a weighted average of the data, where the weights are the current responsibilities.
Summary of One Iteration
Starting from initial values
- compute the responsibilities
and for each observation, - update
using the weighted-average formulas above, and - repeat until the parameter values or the log-likelihood stop changing appreciably.
Sources
- McLachlan, G., and Krishnan, T. (2008). The EM Algorithm and Extensions.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. Chapter 11.
- https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm