Hypothesis Testing
Suppose that we partition the parameter space into two disjoint sets and and that we wish to test
We call the null hypothesis and the alternative hypothesis.
Definition (Null Hypothesis)
The null hypothesis, often denoted , is the claim that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed.
If the null hypothesis exists, any experimentally observed effect is due to change alone.
Definition (Alternative Hypothesis)
In contrast with the null hypothesis, an alternative hypothesis (often denoted or ) is developed, which claims that a relationship does exist between two variables.
Let be a random variable, and let be the range of . We test a hypothesis by finding a appropriate subset of outcomes called the reject region. If we reject the null hypothesis; otherwise, we do not reject the null hypothesis.
Usually, the rejection region is of the form
where is a test statistic and is a critical value. The problem in hypothesis testing is to fund an appropriate test statistic and an appropriate critical value .
For a very common example of this template, see The Wald Test, where the test statistic is a standardized estimator of the form
Warning
There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis.
For example, suppose a company compares an old and new website design and wants to know whether the new design improves average time-on-site. A hypothesis test might show a statistically significant increase, but with a large enough sample even a tiny increase can be significant. In that setting, a confidence interval for the size of the increase is often more useful, because the real question is whether the improvement is large enough to matter in practice.
Definition
The power function of a test with rejection region is defined by
The size of a test is defined to be
A test is said to have level if its size is less than or equal to .
Intuition (Size)
The power function is the probability that the test rejects when the true parameter is . If , then rejecting means we rejected the null hypothesis even though it was true.
So, within the null hypothesis, is the false positive rate at that particular parameter value. The size takes the largest such value over all . In other words, the size is the worst-case probability of incorrectly rejecting the null when the null is true.
This is why a level test is valuable: it guarantees that no matter which parameter value in is the truth, the probability of a false rejection is at most .
Types of Hypothesis Tests
Definition (Simple Hypothesis)
A hypothesis of the form is called a simple hypothesis.
Definition (Composite Hypothesis)
A hypothesis of the form or is called a composite hypothesis.
Definition (Two-Sided Test)
A test of the form
is called a two-sided test.
Definition (One-Sided Test)
A test of the form
or
is called a one-sided test.
The most common tests are two-sided tests.
Common Testing Methods
Different hypothesis tests fit different data-generating settings. The right test depends on the type of data, the null hypothesis being tested, and how trustworthy the underlying assumptions are.
| Method | Good to use when | Not a good choice when | Main assumptions / limitations |
|---|---|---|---|
| The Wald Test | You have an estimator that is approximately Normal and a reliable estimated standard error, especially in large samples. Common for regression coefficients and scalar parameters. | Sample sizes are small, the estimator is skewed, or the parameter is near the boundary of its parameter space. | Relies on asymptotic Normality and a good standard error estimate. Can be unstable when the Normal approximation is poor. |
| Likelihood Ratio Test | You can write down a likelihood and want to compare a restricted model under against a less restricted model. Especially useful for multi-parameter and nested-model problems. | The model is hard to fit, the MLE does not behave regularly, or the null puts parameters on the boundary. | Uses likelihood-based modeling and usually large-sample approximations via Wilks’ theorem. Regularity conditions can fail in nonstandard problems. |
| Pearsons Chi-Squared Test | You have categorical count data, such as multinomial or contingency-table data, and want to compare observed counts to expected counts under . | Expected counts are too small, categories are sparse, or observations are not independent. | Typically justified asymptotically. Works best when expected cell counts are not too small and the data are counts, not arbitrary continuous measurements. |
| The Permutation Test | You want a nonparametric test, especially for small samples, and the null hypothesis implies the labels are exchangeable. Useful when comparing two groups without trusting parametric assumptions. | The observations are dependent, the labels are not exchangeable under the null, or exhaustive resampling is computationally infeasible. | Exact in ideal settings, but depends on exchangeability under . In practice it is often approximated by random resampling rather than all permutations. |
These methods are all instances of the same general idea: build a test statistic that should look typical under , then reject when the observed value looks too extreme.
Example
Let where is known. We want to test versus . Hence, and . Consider the test
where . The rejection region is
Let denote a standard normal random variable. The power function is
This function is increasing in . Hence,
For a size test, we set this equal to and solve for to get
We reject when . Equivalently, we reject when
where .