Maximum Likelihood

The most common method for estimating parameters in a parametric model is the maximum likelihood method.

Definition (Likelihood function)

Let $X_{1}, \dots, X_{n}$ be IID with PDF $f (x; θ)$ . The likelihood function is defined by
$L_{n} (θ) = i = 1 \prod n f (X_{i}; θ) .$
The log-likelihood function is defined by $ℓ_{n} (θ) = lo g L_{n} (θ) .$

Intuition

The likelihood function is just the join density of the data, except that we treat it as a function of the parameter $θ$ . Thus, $L_{n} : Θ \to [0, \infty)$ . The likelihood function is not a density function in general. It is not true that $L_{n} (θ)$ integrates to 1 (with respect to $θ$ ).

Definition (MLE)

The maximum likelihood estimator (MLE), denoted by $\hat{θ}_{n}$ , is the value of $θ$ that maximizes $L_{n} (θ)$ .

Since $lo g$ is a monotonic function, the maximum of $ℓ_{n} (θ)$ occurs at the same place as the maximum of $L_{n} (θ)$ , so maximizing the log-likelihood leads to the same answer as maximizing he likelihood. Often, it is easier to work with the log-likelihood.

Properties of Maximum Likelihood Estimators

Under certain conditions of the model, the maximum likelihood estimator $\hat{θ}_{n}$ possesses many properties that make it an appealing choice of estimator. The main properties of MLE are:

The MLE is consistent: $\hat{θ}_{n} \to θ_{⋆}$ where $θ_{⋆}$ denotes the true value of the parameter $θ$ .
The MLE is equivariant: if $\hat{θ}_{n}$ is the MLE of $θ$ , then $g (\hat{θ}_{n})$ is the MLE of $g (θ)$ .
The MLE is asymptotically Normal: $(\hat{θ} - θ_{⋆}) / \hat{SE} ⇝ N (0, 1)$ ; also, the estimated standard error $\hat{SE}$ can often be computed analytically.
The MLE is asymptotically optimal: roughly, this means that among all well-behaved estimators, the MLE ahs the smallest variance , at least for large samples.
The MLE is approximately the Bayes estimator.

Consistency of Maximum Likelihood Estimators

Consistency means that the MLE converges in probability to the true value. If $f$ and $g$ are PDFs, define the Kullback-Leibler distance between $f$ and $g$ to be

D (f, g) = \int f (x) lo g \frac{f ( x )}{g ( x )} d x .

For any $θ, ψ \in Θ$ write $D (θ, ψ)$ to mean $D (f (x; θ), f (x; ψ))$ .

Definition

A model $F$ is identifiable if $θ \neq = ψ$ implies that $D (θ, ψ) > 0$ . This means that different values of the parameter correspond to different distributions.

Assume that the model is identifiable. Let $θ_{⋆}$ denote the true value of $θ$ . Maximizing $ℓ_{n} (θ)$ is equivalent to maximizing

M_{n} (θ) = \frac{1}{n} i \sum lo g \frac{f ( X _{i} ; θ )}{f ( X _{i} ; θ _{⋆} )} .

This follows since $M_{n} (θ) = n^{- 1} (ℓ_{n} (θ) - ℓ_{n} (θ_{⋆}))$ and $ℓ_{n} (θ_{⋆})$ is a constant (with respect to $θ$ ). For each fixed $θ$ , the randomness is in the sample $X_{1}, \dots, X_{n}$ , and since the data are assumed to be drawn from the true model $f (\cdot; θ_{⋆})$ , the expectation below is taken with respect to that distribution. By the Law of Large Numbers, $M_{n} (θ)$ converges to

E_{θ_{⋆}} (lo g \frac{f ( X _{i} ; θ )}{f ( X _{i} ; θ _{⋆} )}) = \int lo g (\frac{f ( x ; θ )}{f ( x ; θ _{⋆} )}) f (x; θ_{⋆}) d x = - \int lo g (\frac{f ( x ; θ _{⋆} )}{f ( x ; θ )}) f (x; θ_{⋆}) d x = - D (θ_{⋆}, θ) .

Hence, $M_{n} (θ) \approx - D (θ_{⋆}, θ)$ which is maximized at $θ_{⋆}$ since $- D (θ_{⋆}, θ_{⋆}) = 0$ and $- D (θ_{⋆}, θ) < 0$ for $θ \neq = θ_{⋆}$ . This suggests that the maximizer of $M_{n} (θ)$ , and hence the MLE, should converge to $θ_{⋆}$ , with a fully rigorous proof requiring additional regularity conditions.

Asymptotic Normality

In turns out that the distribution of $\hat{θ}_{n}$ is approximately Normal and we can compute its approximate variance analytically.

Definition

The score function is defined to be
$s (X; θ) = \frac{\partial lo g f ( X ; θ )}{\partial θ} .$
The Fisher Information is defined to be
$I_{n} (θ) = V_{θ} (i = 1 \sum n s (X_{i}; θ)) = i = 1 \sum n V_{θ} (s (X_{i}; θ))$
where the second equation is due to the variance of the sum of independent random variables being the sum of their individual variances.

Intuition

The Fisher information measures how much information the data carry about the parameter $θ$ . The score $s (X; θ)$ is the derivative of the log-density, so it measures how sensitive the model is to changes in $θ$ for a single observation. If changing $θ$ tends to change the log-density a lot, then the score will typically have larger magnitude, so its variance will be larger. Thus, the definition $I (θ) = V_{θ} (s (X; θ))$ says that Fisher information is large when the model responds strongly to changes in $θ$ , and small when nearby values of $θ$ are hard to distinguish. In this sense, larger Fisher information means $θ$ can be estimated more precisely.

It can be shown that $E_{θ} (s (X; θ)) = 0$ . It then follows that $V_{θ} (s (X; θ)) = E_{θ} (s^{2} (X; θ))$ . A further simplification of $I_{n} (θ)$ is given in the next result.

Theorem

$I_{n} (θ) = n I (θ)$ , where $I (θ) := I_{1} (θ)$ . Also,
$I (θ) = - E_{θ} (\frac{\partial ^{2} lo g f ( X ; θ )}{\partial θ ^{2}}) = - \int (\frac{\partial ^{2} lo g f ( x ; θ )}{\partial θ ^{2}}) f (x; θ) d x .$

Intuition

This formula gives the same quantity a second interpretation in terms of curvature. The second derivative measures how curved the log-likelihood is as a function of $θ$ . Near a maximum this quantity is usually negative, so the minus sign makes the information positive. If the log-likelihood is sharply curved near the true value, then moving $θ$ slightly makes the fit much worse, so the data determine $θ$ more precisely and the Fisher information is large. If it is relatively flat, then nearby values of $θ$ fit almost equally well, so the Fisher information is small. Therefore, the variance-of-score definition and the negative-expected-curvature definition are two equivalent ways of measuring the same idea: how sensitive the model is to changes in $θ$ .

Theorem (Asymptotic Normality of the MLE)

Let $SE = V (\hat{θ}_{n})$ . Under appropriate regularity conditions, the following hold:

$SE \approx 1/ I_{n} (θ)$ and

$\frac{θ ^ _{n} - θ}{SE} ⇝ N (0, 1) .$

Let $\hat{SE} = 1/ I_{n} (\hat{θ}_{n})$ . Then,

$\frac{θ ^ _{n} - θ}{SE ^} ⇝ N (0, 1) .$

The first statement says that $\hat{θ}_{n} \approx N (θ, SE)$ where the approximate standard error of $\hat{θ}_{n}$ is $SE = 1/ I_{n} (θ)$ .
The second statement says that this is still true even if we replace the standard error by its estimated standard error $\hat{SE} = 1/ I_{n} (\hat{θ}_{n})$ .

Informally, the theorem says that the distribution of the MLE can be approximated with $N (θ, \hat{SE^{2}})$ .

Theorem

Let
$C_{n} = (\hat{θ}_{n} - z_{α /2} \hat{SE}, \hat{θ}_{n} + z_{α /2} \hat{SE}) .$
Then $P_{θ} (θ \in C_{n}) \to 1 - α$ as $n \to \infty$ .

Optimality

Suppose that $X_{1}, \dots, X_{n} \sim N (θ, σ^{2})$ . The MLE is $\hat{θ}_{n} = \overset{ˉ}{X}_{n}$ . Another reasonable estimator of $θ$ is the sample median $\tilde{θ}_{n}$ . The MLE satisfies

n (\hat{θ}_{n} - θ) ⇝ N (0, σ^{2}) .

It can be proved that the median satisfies

n (\tilde{θ}_{n} - θ) ⇝ N (0, σ^{2} \frac{π}{2}) .

This means that the median converges to the right value but has a larger variance than the MLE.

More generally, consider two estimators $T_{n}$ and $U_{n}$ and suppose that

n (T_{n} - θ) ⇝ N (0, t^{2}),

and that

n (U_{n} - θ) ⇝ N (0, u^{2}) .

We define the asymptotic relative efficiency of $U$ to $T$ by $ARE (U, T) = t^{2} / u^{2}$ . In the Normal example, $ARE (\tilde{θ}_{n}, \hat{θ}_{n}) = 2/ π = 0.63$ . The interpretation is that if you use the median, you are effectively using only a fraction of the data.

Theorem

If $\hat{θ}_{n}$ is the MLE and $\tilde{θ}_{n}$ is any other estimator, then
$ARE (\tilde{θ}_{n}, \hat{θ}_{n}) \leq 1.$

Interpretation

Thus, the MLE has the smallest (asymptotic) variance and we say that the MLE is efficient or asymptotically optimal.

Jake Tuero

Explorer

Maximum Likelihood Estimation

Maximum Likelihood

Properties of Maximum Likelihood Estimators

Consistency of Maximum Likelihood Estimators

Asymptotic Normality

Optimality

Table of Contents