p-values
Reporting reject
This ties the p-value directly to the idea of the size of a test. The size
Definition (p-value)
Suppose that for every
we have a size test with reject region . Then, That is, the p-value is the smallest level at which we can reject
.
Informally, the p-value is a measure of the evidence against
Intuition
Imagine sliding the significance level
from very small to larger values. The p-value is the first point where the observed data becomes extreme enough to enter the rejection region. So a p-value of
means: this data is just strong enough to reject at levels and , but not at level .
Warning
A large p-value is not strong evidence in favor of
. A large p-value can occur for two reasons:
is true, or is false but the test has low power.
Warning
A p-value is not the probability that
is true. It is a probability computed assuming is true. More precisely, it is about how surprising the observed test statistic would be under the null model.
The following result explains how to compute the p-value
Theorem
Suppose that the size
test is of the form Then,
where
denotes the random sample and denotes the observed sample. Thus, is the random test statistic and is its observed value. If , then
Intuition
The p-value is the probability (under
) of observing a value of the test statistic the same as or more extreme than what was actually observed. In other words: if the null hypothesis were true, how surprising would this data look?
Connection to the Reject Region
In hypothesis testing, a test rejects when the statistic falls in a reject region. The p-value repackages that same information into a single number.
- The reject region answers: at a fixed level
, do we reject or not? - The p-value answers: how far into the reject region did the observed statistic go?
So the p-value is not a different idea from hypothesis testing. It is a more informative summary of the same test.
Why Smaller Means Stronger Evidence
If the observed test statistic is very extreme under
will be small. That means the observed data would be unusual if the null were true, so we view the data as evidence against
If the p-value is not small, then the data is not especially surprising under
Geometric Interpretation
There is a useful geometric way to think about both the size
Consider the sampling distribution of the test statistic under
- The size
is the area of the reject region under the null distribution. - The p-value is the area under the null distribution corresponding to values at least as extreme as the observed test statistic.
For a right-tailed test, the reject region has the form
where
Geometrically,
After observing the data, suppose the test statistic takes the value
which is the area to the right of the observed value.
So, geometrically:
sets a cutoff in advance - the p-value measures the tail area beyond the observed statistic
For a two-sided test, extremeness is measured in both tails. In that case, the p-value is the total area in both tails beyond the observed magnitude of the test statistic.
This geometric picture explains why
If the tail area beyond the observed statistic is smaller than the pre-chosen tail area
Example
Suppose we are testing
using the Wald statistic
Under
where
For example, if the observed test statistic is
then
So:
- we reject at level
- we reject at level
- we do not reject at level
This shows both interpretations at once:
- as a threshold rule,
is the smallest level at which the test rejects - as a tail probability,
is the probability under of seeing a test statistic at least this extreme
p-hacking
In principled hypothesis testing, the testing procedure is chosen before looking at the data. This includes:
- the null and alternative hypotheses
- the test statistic
- the significance level
- the stopping rule and data-cleaning rules
After that, the data is collected, the p-value is computed, and a decision is made about whether to reject.
Intuition (p-hacking)
p-hacking is what happens when the analysis is adjusted after looking at the data in order to make the p-value small enough to cross the chosen threshold.
In other words, instead of asking “if
were true, how surprising is this data under the pre-chosen test?”, the procedure is repeatedly altered until the data looks sufficiently surprising.
Common forms of p-hacking include:
- trying many outcomes and reporting only the one with the smallest p-value
- stopping data collection as soon as the p-value drops below
- trying several model specifications and only reporting the significant one
- removing “outliers” only when doing so helps produce significance
- switching between one-sided and two-sided tests after seeing the data
Why p-hacking is a problem
The size
For example, if each test is performed at level
Note
p-hacking can be understood as breaking the link between the reported p-value and the pre-specified size of the test.
Sources
- Wasserman, L. (2010). All of Statistics: A concise Course in Statistical Inference. Chapter 10.2.