p-values

Reporting reject $H_{0}$ or retain $H_{0}$ is not very informative. Instead, we could ask, for every $α$ , whether the test rejects at that level. Generally, if the tests rejects at level $α$ , it will also reject at level $α^{'} > α$ . Hence, there is a smallest $α$ at which the test rejects and we call this number the p-value.

This ties the p-value directly to the idea of the size of a test. The size $α$ is chosen in advance as a tolerated false-positive rate, while the p-value is computed after seeing the data. The rule is:

reject H_{0} at level α ⟺ p-value \leq α .

Definition (p-value)

Suppose that for every $α \in (0, 1)$ we have a size $α$ test with reject region $R_{α}$ . Then,
$p-value = in f {α : T (X^{n}) \in R_{α}} .$
That is, the p-value is the smallest level at which we can reject $H_{0}$ .

Informally, the p-value is a measure of the evidence against $H_{0}$ : the smaller the p-value, the stronger the evidence against $H_{0}$ .

Intuition

Imagine sliding the significance level $α$ from very small to larger values. The p-value is the first point where the observed data becomes extreme enough to enter the rejection region.

So a p-value of $0.03$ means: this data is just strong enough to reject at levels $0.03, 0.05,$ and $0.10$ , but not at level $0.01$ .

Warning

A large p-value is not strong evidence in favor of $H_{0}$ . A large p-value can occur for two reasons:

$H_{0}$ is true, or

$H_{0}$ is false but the test has low power.

Warning

A p-value is not the probability that $H_{0}$ is true. It is a probability computed assuming $H_{0}$ is true.

More precisely, it is about how surprising the observed test statistic would be under the null model.

The following result explains how to compute the p-value

Theorem

Suppose that the size $α$ test is of the form
$reject H_{0} if and only if T (X^{n}) \geq c_{α} .$
Then,
$p-value = θ \in Θ_{0} sup P_{θ} (T (X^{n}) \geq T (x^{n}))$
where $X^{n} = (X_{1}, \dots, X_{n})$ denotes the random sample and $x^{n} = (x_{1}, \dots, x_{n})$ denotes the observed sample. Thus, $T (X^{n})$ is the random test statistic and $T (x^{n})$ is its observed value. If $Θ_{0} = {θ_{0}}$ , then
$p-value = P_{θ_{0}} (T (X^{n}) \geq T (x^{n})) .$

Intuition

The p-value is the probability (under $H_{0}$ ) of observing a value of the test statistic the same as or more extreme than what was actually observed.

In other words: if the null hypothesis were true, how surprising would this data look?

Connection to the Reject Region

In hypothesis testing, a test rejects when the statistic falls in a reject region. The p-value repackages that same information into a single number.

The reject region answers: at a fixed level $α$ , do we reject or not?
The p-value answers: how far into the reject region did the observed statistic go?

So the p-value is not a different idea from hypothesis testing. It is a more informative summary of the same test.

Why Smaller Means Stronger Evidence

If the observed test statistic is very extreme under $H_{0}$ , then the tail probability

P_{θ_{0}} (T (X^{n}) \geq T (x^{n}))

will be small. That means the observed data would be unusual if the null were true, so we view the data as evidence against $H_{0}$ .

If the p-value is not small, then the data is not especially surprising under $H_{0}$ . But this only means the data is compatible with $H_{0}$ ; it does not prove that $H_{0}$ is correct.

Geometric Interpretation

There is a useful geometric way to think about both the size $α$ and the p-value.

Consider the sampling distribution of the test statistic under $H_{0}$ . The x-axis represents possible values of the test statistic, and the area under the curve represents probability.

The size $α$ is the area of the reject region under the null distribution.
The p-value is the area under the null distribution corresponding to values at least as extreme as the observed test statistic.

For a right-tailed test, the reject region has the form

R_{α} = {T (X^{n}) > c_{α}},

where $c_{α}$ is chosen so that

P_{H_{0}} (T (X^{n}) > c_{α}) = α .

Geometrically, $α$ is the area to the right of the critical value $c_{α}$ .

After observing the data, suppose the test statistic takes the value $T (x^{n}) = t_{o b s}$ . Then the p-value is

P_{H_{0}} (T (X^{n}) \geq t_{o b s}),

which is the area to the right of the observed value.

So, geometrically:

$α$ sets a cutoff in advance
the p-value measures the tail area beyond the observed statistic

For a two-sided test, extremeness is measured in both tails. In that case, the p-value is the total area in both tails beyond the observed magnitude of the test statistic.

This geometric picture explains why

reject at level α ⟺ p-value \leq α .

If the tail area beyond the observed statistic is smaller than the pre-chosen tail area $α$ , then the observed statistic lies in the reject region.

Example

Suppose we are testing

H_{0} : θ = θ_{0} versus H_{1} : θ \neq = θ_{0}

using the Wald statistic

W = \frac{θ ^ - θ _{0}}{SE ^} .

Under $H_{0}$ , we have approximately $W \sim N (0, 1)$ . For a two-sided test, the p-value is therefore

p-value = P (∣ Z ∣ \geq ∣ w_{o b s} ∣) = 2 P (Z \geq ∣ w_{o b s} ∣) = 2 (1 - Φ (∣ w_{o b s} ∣)),

where $Z \sim N (0, 1)$ and $w_{o b s}$ is the observed value of $W$ .

For example, if the observed test statistic is

w_{o b s} = 2.1,

then

p-value = 2 (1 - Φ (2.1)) \approx 0.036.

So:

we reject at level $0.05$
we reject at level $0.10$
we do not reject at level $0.01$

This shows both interpretations at once:

as a threshold rule, $0.036$ is the smallest level at which the test rejects
as a tail probability, $0.036$ is the probability under $H_{0}$ of seeing a test statistic at least this extreme

p-hacking

In principled hypothesis testing, the testing procedure is chosen before looking at the data. This includes:

the null and alternative hypotheses
the test statistic
the significance level $α$
the stopping rule and data-cleaning rules

After that, the data is collected, the p-value is computed, and a decision is made about whether to reject.

Intuition (p-hacking)

p-hacking is what happens when the analysis is adjusted after looking at the data in order to make the p-value small enough to cross the chosen threshold.

In other words, instead of asking “if $H_{0}$ were true, how surprising is this data under the pre-chosen test?”, the procedure is repeatedly altered until the data looks sufficiently surprising.

Common forms of p-hacking include:

trying many outcomes and reporting only the one with the smallest p-value
stopping data collection as soon as the p-value drops below $0.05$
trying several model specifications and only reporting the significant one
removing “outliers” only when doing so helps produce significance
switching between one-sided and two-sided tests after seeing the data

Why p-hacking is a problem

The size $α$ only controls the false-positive rate for a fixed testing procedure. Once we try many procedures and only report the favorable one, the true false-positive rate can be much larger than $α$ .

For example, if each test is performed at level $0.05$ , then each individual test has at most a $5%$ false-positive rate under $H_{0}$ . But if we try many tests and keep the most favorable result, the chance that at least one of them looks significant can become much larger than $5%$ .

Note

p-hacking can be understood as breaking the link between the reported p-value and the pre-specified size of the test.

Jake Tuero

Explorer

p-values

p-values

Connection to the Reject Region

Why Smaller Means Stronger Evidence

Geometric Interpretation

Example

p-hacking

Why p-hacking is a problem

Table of Contents