Maximum Likelihood

The most common method for estimating parameters in a parametric model is the maximum likelihood method.

Definition (Likelihood function)

Let be IID with PDF . The likelihood function is defined by

The log-likelihood function is defined by

Intuition

The likelihood function is just the join density of the data, except that we treat it as a function of the parameter . Thus, . The likelihood function is not a density function in general. It is not true that integrates to 1 (with respect to ).

Definition (MLE)

The maximum likelihood estimator (MLE), denoted by , is the value of that maximizes .

Since is a monotonic function, the maximum of occurs at the same place as the maximum of , so maximizing the log-likelihood leads to the same answer as maximizing he likelihood. Often, it is easier to work with the log-likelihood.

Properties of Maximum Likelihood Estimators

Under certain conditions of the model, the maximum likelihood estimator possesses many properties that make it an appealing choice of estimator. The main properties of MLE are:

  1. The MLE is consistent: where denotes the true value of the parameter .
  2. The MLE is equivariant: if is the MLE of , then is the MLE of .
  3. The MLE is asymptotically Normal: ; also, the estimated standard error can often be computed analytically.
  4. The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE ahs the smallest variance , at least for large samples.
  5. The MLE is approximately the Bayes estimator.

Consistency of Maximum Likelihood Estimators

Consistency means that the MLE converges in probability to the true value. If and are PDFs, define the Kullback-Leibler distance between and to be

For any write to mean .

Definition

A model is identifiable if implies that . This means that different values of the parameter correspond to different distributions.

Assume that the model is identifiable. Let denote the true value of . Maximizing is equivalent to maximizing

This follows since and is a constant (with respect to ). For each fixed , the randomness is in the sample , and since the data are assumed to be drawn from the true model , the expectation below is taken with respect to that distribution. By the Law of Large Numbers, converges to

Hence, which is maximized at since and for . This suggests that the maximizer of , and hence the MLE, should converge to , with a fully rigorous proof requiring additional regularity conditions.

Asymptotic Normality

In turns out that the distribution of is approximately Normal and we can compute its approximate variance analytically.

Definition

The score function is defined to be

The Fisher Information is defined to be

where the second equation is due to the variance of the sum of independent random variables being the sum of their individual variances.

Intuition

The Fisher information measures how much information the data carry about the parameter . The score is the derivative of the log-density, so it measures how sensitive the model is to changes in for a single observation. If changing tends to change the log-density a lot, then the score will typically have larger magnitude, so its variance will be larger. Thus, the definition says that Fisher information is large when the model responds strongly to changes in , and small when nearby values of are hard to distinguish. In this sense, larger Fisher information means can be estimated more precisely.

It can be shown that . It then follows that . A further simplification of is given in the next result.

Theorem

, where . Also,

Intuition

This formula gives the same quantity a second interpretation in terms of curvature. The second derivative measures how curved the log-likelihood is as a function of . Near a maximum this quantity is usually negative, so the minus sign makes the information positive. If the log-likelihood is sharply curved near the true value, then moving slightly makes the fit much worse, so the data determine more precisely and the Fisher information is large. If it is relatively flat, then nearby values of fit almost equally well, so the Fisher information is small. Therefore, the variance-of-score definition and the negative-expected-curvature definition are two equivalent ways of measuring the same idea: how sensitive the model is to changes in .

Theorem (Asymptotic Normality of the MLE)

Let . Under appropriate regularity conditions, the following hold:

  1. and
  1. Let . Then,
  • The first statement says that where the approximate standard error of is .
  • The second statement says that this is still true even if we replace the standard error by its estimated standard error .

Informally, the theorem says that the distribution of the MLE can be approximated with .

Theorem

Let

Then as .

Optimality

Suppose that . The MLE is . Another reasonable estimator of is the sample median . The MLE satisfies

It can be proved that the median satisfies

This means that the median converges to the right value but has a larger variance than the MLE.

More generally, consider two estimators and and suppose that

and that

We define the asymptotic relative efficiency of to by . In the Normal example, . The interpretation is that if you use the median, you are effectively using only a fraction of the data.

Theorem

If is the MLE and is any other estimator, then

Interpretation

Thus, the MLE has the smallest (asymptotic) variance and we say that the MLE is efficient or asymptotically optimal.