p-values
Reporting reject or retain is not very informative. Instead, we could ask, for every , whether the test rejects at that level. Generally, if the tests rejects at level , it will also reject at level . Hence, there is a smallest at which the test rejects and we call this number the p-value.
This ties the p-value directly to the idea of the size of a test. The size is chosen in advance as a tolerated false-positive rate, while the p-value is computed after seeing the data. The rule is:
Definition (p-value)
Suppose that for every we have a size test with reject region . Then,
That is, the p-value is the smallest level at which we can reject .
Informally, the p-value is a measure of the evidence against : the smaller the p-value, the stronger the evidence against .
Intuition
Imagine sliding the significance level from very small to larger values. The p-value is the first point where the observed data becomes extreme enough to enter the rejection region.
So a p-value of means: this data is just strong enough to reject at levels and , but not at level .
Warning
A large p-value is not strong evidence in favor of . A large p-value can occur for two reasons:
- is true, or
- is false but the test has low power.
Warning
A p-value is not the probability that is true. It is a probability computed assuming is true.
More precisely, it is about how surprising the observed test statistic would be under the null model.
The following result explains how to compute the p-value
Theorem
Suppose that the size test is of the form
Then,
where denotes the random sample and denotes the observed sample. Thus, is the random test statistic and is its observed value. If , then
Intuition
The p-value is the probability (under ) of observing a value of the test statistic the same as or more extreme than what was actually observed.
In other words: if the null hypothesis were true, how surprising would this data look?
Connection to the Reject Region
In hypothesis testing, a test rejects when the statistic falls in a reject region. The p-value repackages that same information into a single number.
- The reject region answers: at a fixed level , do we reject or not?
- The p-value answers: how far into the reject region did the observed statistic go?
So the p-value is not a different idea from hypothesis testing. It is a more informative summary of the same test.
Why Smaller Means Stronger Evidence
If the observed test statistic is very extreme under , then the tail probability
will be small. That means the observed data would be unusual if the null were true, so we view the data as evidence against .
If the p-value is not small, then the data is not especially surprising under . But this only means the data is compatible with ; it does not prove that is correct.
Geometric Interpretation
There is a useful geometric way to think about both the size and the p-value.
Consider the sampling distribution of the test statistic under . The x-axis represents possible values of the test statistic, and the area under the curve represents probability.
- The size is the area of the reject region under the null distribution.
- The p-value is the area under the null distribution corresponding to values at least as extreme as the observed test statistic.
For a right-tailed test, the reject region has the form
where is chosen so that
Geometrically, is the area to the right of the critical value .
After observing the data, suppose the test statistic takes the value . Then the p-value is
which is the area to the right of the observed value.
So, geometrically:
- sets a cutoff in advance
- the p-value measures the tail area beyond the observed statistic
For a two-sided test, extremeness is measured in both tails. In that case, the p-value is the total area in both tails beyond the observed magnitude of the test statistic.
This geometric picture explains why
If the tail area beyond the observed statistic is smaller than the pre-chosen tail area , then the observed statistic lies in the reject region.
Example
Suppose we are testing
using the Wald statistic
Under , we have approximately . For a two-sided test, the p-value is therefore
where and is the observed value of .
For example, if the observed test statistic is
then
So:
- we reject at level
- we reject at level
- we do not reject at level
This shows both interpretations at once:
- as a threshold rule, is the smallest level at which the test rejects
- as a tail probability, is the probability under of seeing a test statistic at least this extreme
p-hacking
In principled hypothesis testing, the testing procedure is chosen before looking at the data. This includes:
- the null and alternative hypotheses
- the test statistic
- the significance level
- the stopping rule and data-cleaning rules
After that, the data is collected, the p-value is computed, and a decision is made about whether to reject.
Intuition (p-hacking)
p-hacking is what happens when the analysis is adjusted after looking at the data in order to make the p-value small enough to cross the chosen threshold.
In other words, instead of asking “if were true, how surprising is this data under the pre-chosen test?”, the procedure is repeatedly altered until the data looks sufficiently surprising.
Common forms of p-hacking include:
- trying many outcomes and reporting only the one with the smallest p-value
- stopping data collection as soon as the p-value drops below
- trying several model specifications and only reporting the significant one
- removing “outliers” only when doing so helps produce significance
- switching between one-sided and two-sided tests after seeing the data
Why p-hacking is a problem
The size only controls the false-positive rate for a fixed testing procedure. Once we try many procedures and only report the favorable one, the true false-positive rate can be much larger than .
For example, if each test is performed at level , then each individual test has at most a false-positive rate under . But if we try many tests and keep the most favorable result, the chance that at least one of them looks significant can become much larger than .
Note
p-hacking can be understood as breaking the link between the reported p-value and the pre-specified size of the test.