Hypothesis Testing

Suppose that we partition the parameter space $Θ$ into two disjoint sets $Θ_{0}$ and $Θ_{1}$ and that we wish to test

H_{0} : θ \in Θ_{0} versus H_{1} : θ \in Θ_{1} .

We call $H_{0}$ the null hypothesis and $H_{1}$ the alternative hypothesis.

Definition (Null Hypothesis)

The null hypothesis, often denoted $H_{0}$ , is the claim that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data or variables being analyzed.

If the null hypothesis exists, any experimentally observed effect is due to change alone.

Definition (Alternative Hypothesis)

In contrast with the null hypothesis, an alternative hypothesis (often denoted $H_{A}$ or $H_{1}$ ) is developed, which claims that a relationship does exist between two variables.

Let $X$ be a random variable, and let $X$ be the range of $X$ . We test a hypothesis by finding a appropriate subset of outcomes $R \subset X$ called the reject region. If $X \in R$ we reject the null hypothesis; otherwise, we do not reject the null hypothesis.

X \in R X \in / R \Rightarrow reject H_{0} \Rightarrow retain (do not reject) H_{0}

Usually, the rejection region $R$ is of the form

R = {x : T (x) > c}

where $T$ is a test statistic and $c$ is a critical value. The problem in hypothesis testing is to fund an appropriate test statistic $T$ and an appropriate critical value $c$ .

For a very common example of this template, see The Wald Test, where the test statistic is a standardized estimator of the form

T = \frac{θ ^ - θ _{0}}{SE ^} .

Warning

There is a tendency to use hypothesis testing methods even when they are not appropriate. Often, estimation and confidence intervals are better tools. Use hypothesis testing only when you want to test a well-defined hypothesis.

For example, suppose a company compares an old and new website design and wants to know whether the new design improves average time-on-site. A hypothesis test might show a statistically significant increase, but with a large enough sample even a tiny increase can be significant. In that setting, a confidence interval for the size of the increase is often more useful, because the real question is whether the improvement is large enough to matter in practice.

Definition

The power function of a test with rejection region $R$ is defined by
$β (θ) = P_{θ} (X \in R)$
The size of a test is defined to be
$α = θ \in Θ_{0} sup β (θ) .$
A test is said to have level $α$ if its size is less than or equal to $α$ .

Intuition (Size)

The power function $β (θ)$ is the probability that the test rejects when the true parameter is $θ$ . If $θ \in Θ_{0}$ , then rejecting means we rejected the null hypothesis even though it was true.

So, within the null hypothesis, $β (θ)$ is the false positive rate at that particular parameter value. The size takes the largest such value over all $θ \in Θ_{0}$ . In other words, the size is the worst-case probability of incorrectly rejecting the null when the null is true.

This is why a level $α$ test is valuable: it guarantees that no matter which parameter value in $H_{0}$ is the truth, the probability of a false rejection is at most $α$ .

Types of Hypothesis Tests

Definition (Simple Hypothesis)

A hypothesis of the form $θ = θ_{0}$ is called a simple hypothesis.

Definition (Composite Hypothesis)

A hypothesis of the form $θ > θ_{0}$ or $θ < θ_{0}$ is called a composite hypothesis.

Definition (Two-Sided Test)

A test of the form
$H_{0} : θ = θ_{0} versus H_{1} : θ \neq = θ_{0}$
is called a two-sided test.

Definition (One-Sided Test)

A test of the form
$H_{0} : θ \leq θ_{0} versus H_{1} : θ > θ_{0}$
or
$H_{0} : θ < θ_{0} versus H_{1} : θ \geq θ_{0}$
is called a one-sided test.

The most common tests are two-sided tests.

Common Testing Methods

Different hypothesis tests fit different data-generating settings. The right test depends on the type of data, the null hypothesis being tested, and how trustworthy the underlying assumptions are.

Method	Good to use when	Not a good choice when	Main assumptions / limitations
The Wald Test	You have an estimator that is approximately Normal and a reliable estimated standard error, especially in large samples. Common for regression coefficients and scalar parameters.	Sample sizes are small, the estimator is skewed, or the parameter is near the boundary of its parameter space.	Relies on asymptotic Normality and a good standard error estimate. Can be unstable when the Normal approximation is poor.
Likelihood Ratio Test	You can write down a likelihood and want to compare a restricted model under $H_{0}$ against a less restricted model. Especially useful for multi-parameter and nested-model problems.	The model is hard to fit, the MLE does not behave regularly, or the null puts parameters on the boundary.	Uses likelihood-based modeling and usually large-sample $χ^{2}$ approximations via Wilks’ theorem. Regularity conditions can fail in nonstandard problems.
Pearsons Chi-Squared Test	You have categorical count data, such as multinomial or contingency-table data, and want to compare observed counts to expected counts under $H_{0}$ .	Expected counts are too small, categories are sparse, or observations are not independent.	Typically justified asymptotically. Works best when expected cell counts are not too small and the data are counts, not arbitrary continuous measurements.
The Permutation Test	You want a nonparametric test, especially for small samples, and the null hypothesis implies the labels are exchangeable. Useful when comparing two groups without trusting parametric assumptions.	The observations are dependent, the labels are not exchangeable under the null, or exhaustive resampling is computationally infeasible.	Exact in ideal settings, but depends on exchangeability under $H_{0}$ . In practice it is often approximated by random resampling rather than all permutations.

These methods are all instances of the same general idea: build a test statistic that should look typical under $H_{0}$ , then reject when the observed value looks too extreme.

Example

Let $X_{1}, \dots, X_{n} \sim N (μ, σ^{2})$ where $σ$ is known. We want to test $H_{0} : μ \leq 0$ versus $H_{1} : μ > 0$ . Hence, $Θ_{0} = (- \infty, 0]$ and $Θ_{1} = (0, \infty)$ . Consider the test
$reject H_{0} if T > C$
where $T = \overset{ˉ}{X}$ . The rejection region is
$R = {(x_{1}, \dots, x_{n}) : T (x_{1}, \dots, x_{n}) > c} .$
Let $Z$ denote a standard normal random variable. The power function is
$β (μ) = P_{μ} (\overset{ˉ}{X} > c) = P_{μ} (\frac{n ( X ˉ - μ )}{σ} > \frac{n ( c - μ )}{σ}) = P (Z > \frac{n ( c - μ )}{σ}) = 1 - Φ (\frac{n ( c - μ )}{σ}) .$
This function is increasing in $μ$ . Hence,
$size = μ \leq 0 sup β (μ) = β (0) = 1 - Φ (\frac{n c}{σ}) .$
For a size $α$ test, we set this equal to $α$ and solve for $c$ to get
$c = \frac{σ Φ ^{- 1} ( 1 - α )}{n} .$
We reject when $\overset{ˉ}{X} > σ Φ^{- 1} (1 - α) / n$ . Equivalently, we reject when
$\frac{n ( X ˉ - 0 )}{σ} > z_{α},$
where $z_{α} = Φ^{- 1} (1 - α)$ .

Jake Tuero

Explorer

Hypothesis Testing

Hypothesis Testing

Types of Hypothesis Tests

Common Testing Methods

Table of Contents