p-values

Reporting reject or retain is not very informative. Instead, we could ask, for every , whether the test rejects at that level. Generally, if the tests rejects at level , it will also reject at level . Hence, there is a smallest at which the test rejects and we call this number the p-value.

This ties the p-value directly to the idea of the size of a test. The size is chosen in advance as a tolerated false-positive rate, while the p-value is computed after seeing the data. The rule is:

Definition (p-value)

Suppose that for every we have a size test with reject region . Then,

That is, the p-value is the smallest level at which we can reject .

Informally, the p-value is a measure of the evidence against : the smaller the p-value, the stronger the evidence against .

Intuition

Imagine sliding the significance level from very small to larger values. The p-value is the first point where the observed data becomes extreme enough to enter the rejection region.

So a p-value of means: this data is just strong enough to reject at levels and , but not at level .

Warning

A large p-value is not strong evidence in favor of . A large p-value can occur for two reasons:

is true, or

is false but the test has low power.

Warning

A p-value is not the probability that is true. It is a probability computed assuming is true.

More precisely, it is about how surprising the observed test statistic would be under the null model.

The following result explains how to compute the p-value

Theorem

Suppose that the size test is of the form

Then,

where denotes the random sample and denotes the observed sample. Thus, is the random test statistic and is its observed value. If , then

Intuition

The p-value is the probability (under ) of observing a value of the test statistic the same as or more extreme than what was actually observed.

In other words: if the null hypothesis were true, how surprising would this data look?

Connection to the Reject Region

In hypothesis testing, a test rejects when the statistic falls in a reject region. The p-value repackages that same information into a single number.

The reject region answers: at a fixed level , do we reject or not?
The p-value answers: how far into the reject region did the observed statistic go?

So the p-value is not a different idea from hypothesis testing. It is a more informative summary of the same test.

Why Smaller Means Stronger Evidence

If the observed test statistic is very extreme under , then the tail probability

will be small. That means the observed data would be unusual if the null were true, so we view the data as evidence against .

If the p-value is not small, then the data is not especially surprising under . But this only means the data is compatible with ; it does not prove that is correct.

Geometric Interpretation

There is a useful geometric way to think about both the size and the p-value.

Consider the sampling distribution of the test statistic under . The x-axis represents possible values of the test statistic, and the area under the curve represents probability.

The size is the area of the reject region under the null distribution.
The p-value is the area under the null distribution corresponding to values at least as extreme as the observed test statistic.

For a right-tailed test, the reject region has the form

where is chosen so that

Geometrically, is the area to the right of the critical value .

After observing the data, suppose the test statistic takes the value . Then the p-value is

which is the area to the right of the observed value.

So, geometrically:

sets a cutoff in advance
the p-value measures the tail area beyond the observed statistic

For a two-sided test, extremeness is measured in both tails. In that case, the p-value is the total area in both tails beyond the observed magnitude of the test statistic.

This geometric picture explains why

If the tail area beyond the observed statistic is smaller than the pre-chosen tail area , then the observed statistic lies in the reject region.

Example

Suppose we are testing

using the Wald statistic

Under , we have approximately . For a two-sided test, the p-value is therefore

where and is the observed value of .

For example, if the observed test statistic is

then

So:

we reject at level
we reject at level
we do not reject at level

This shows both interpretations at once:

as a threshold rule, is the smallest level at which the test rejects
as a tail probability, is the probability under of seeing a test statistic at least this extreme

p-hacking

In principled hypothesis testing, the testing procedure is chosen before looking at the data. This includes:

the null and alternative hypotheses
the test statistic
the significance level
the stopping rule and data-cleaning rules

After that, the data is collected, the p-value is computed, and a decision is made about whether to reject.

Intuition (p-hacking)

p-hacking is what happens when the analysis is adjusted after looking at the data in order to make the p-value small enough to cross the chosen threshold.

In other words, instead of asking “if were true, how surprising is this data under the pre-chosen test?”, the procedure is repeatedly altered until the data looks sufficiently surprising.

Common forms of p-hacking include:

trying many outcomes and reporting only the one with the smallest p-value
stopping data collection as soon as the p-value drops below
trying several model specifications and only reporting the significant one
removing “outliers” only when doing so helps produce significance
switching between one-sided and two-sided tests after seeing the data

Why p-hacking is a problem

The size only controls the false-positive rate for a fixed testing procedure. Once we try many procedures and only report the favorable one, the true false-positive rate can be much larger than .

For example, if each test is performed at level , then each individual test has at most a false-positive rate under . But if we try many tests and keep the most favorable result, the chance that at least one of them looks significant can become much larger than .

Note

p-hacking can be understood as breaking the link between the reported p-value and the pre-specified size of the test.

Sources

Wasserman, L. (2010). All of Statistics: A concise Course in Statistical Inference. Chapter 10.2.

Jake Tuero

Explorer

p-values

p-values

Connection to the Reject Region

Why Smaller Means Stronger Evidence

Geometric Interpretation

Example

p-hacking

Why p-hacking is a problem

Sources

Graph View

Table of Contents