Maximum Likelihood
The most common method for estimating parameters in a parametric model is the maximum likelihood method.
Definition (Likelihood function)
Let
be IID with PDF . The likelihood function is defined by The log-likelihood function is defined by
Intuition
The likelihood function is just the join density of the data, except that we treat it as a function of the parameter
. Thus, . The likelihood function is not a density function in general. It is not true that integrates to 1 (with respect to ).
Definition (MLE)
The maximum likelihood estimator (MLE), denoted by
, is the value of that maximizes .
Since
Properties of Maximum Likelihood Estimators
Under certain conditions of the model, the maximum likelihood estimator
- The MLE is consistent:
where denotes the true value of the parameter . - The MLE is equivariant: if
is the MLE of , then is the MLE of . - The MLE is asymptotically Normal:
; also, the estimated standard error can often be computed analytically. - The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE ahs the smallest variance , at least for large samples.
- The MLE is approximately the Bayes estimator.
Consistency of Maximum Likelihood Estimators
Consistency means that the MLE converges in probability to the true value. If
For any
Definition
A model
is identifiable if implies that . This means that different values of the parameter correspond to different distributions.
Assume that the model is identifiable. Let
This follows since
Hence,
Asymptotic Normality
In turns out that the distribution of
Definition
The score function is defined to be
The Fisher Information is defined to be
where the second equation is due to the variance of the sum of independent random variables being the sum of their individual variances.
Intuition
The Fisher information measures how much information the data carry about the parameter
. The score is the derivative of the log-density, so it measures how sensitive the model is to changes in for a single observation. If changing tends to change the log-density a lot, then the score will typically have larger magnitude, so its variance will be larger. Thus, the definition says that Fisher information is large when the model responds strongly to changes in , and small when nearby values of are hard to distinguish. In this sense, larger Fisher information means can be estimated more precisely.
It can be shown that
Theorem
, where . Also,
Intuition
This formula gives the same quantity a second interpretation in terms of curvature. The second derivative measures how curved the log-likelihood is as a function of
. Near a maximum this quantity is usually negative, so the minus sign makes the information positive. If the log-likelihood is sharply curved near the true value, then moving slightly makes the fit much worse, so the data determine more precisely and the Fisher information is large. If it is relatively flat, then nearby values of fit almost equally well, so the Fisher information is small. Therefore, the variance-of-score definition and the negative-expected-curvature definition are two equivalent ways of measuring the same idea: how sensitive the model is to changes in .
Theorem (Asymptotic Normality of the MLE)
Let
. Under appropriate regularity conditions, the following hold:
and
- Let
. Then,
- The first statement says that
where the approximate standard error of is . - The second statement says that this is still true even if we replace the standard error by its estimated standard error
.
Informally, the theorem says that the distribution of the MLE can be approximated with
Theorem
Let
Then
as .
Optimality
Suppose that
It can be proved that the median satisfies
This means that the median converges to the right value but has a larger variance than the MLE.
More generally, consider two estimators
and that
We define the asymptotic relative efficiency of
Theorem
If
is the MLE and is any other estimator, then
Interpretation
Thus, the MLE has the smallest (asymptotic) variance and we say that the MLE is efficient or asymptotically optimal.
Sources
- Wasserman, L. (2010). All of Statistics: A concise Course in Statistical Inference. Chapter 9.