Probability Inequalities

Markov’s Inequality

https://en.wikipedia.org/wiki/Markov%27s_inequality

Markov’s inequality provides a bound on the probability that the realization of random variable exceeds a given threshold. It relates probabilities to expectations.

Theorem (Markov's Inequality)

Let $X$ be a non-negative random variable, and suppose that $E (X)$ exists. For any $t > 0$ , $P (X > t) \leq \frac{E ( X )}{t} .$

Proof

Since $X > 0$ ,
$E (X) = \int_{0}^{\infty} x f (x) d x = \int_{0}^{t} x f (x) d x + \int_{t}^{\infty} x f (x) d x \geq \int_{t}^{\infty} x f (x) d x \geq \int_{t}^{\infty} f (x) d x = t P (X > t) .$

Suppose $E (X) = 10$ , and by Markov’s Inequality, $P (X \geq 20) \leq 0.5$ . The intuition here is that you cannot have more than half of the distribution to the right of 20 to still have enough distribution left to result in an expected value of 10. The further away of a value is from the expected value, the less density can be while keeping the expected value the same. Note that this bound may not always be tight. Markov’s Inequality is also used to prove other inequalities, such as Chebyshev’s Inequality below.

Chebyshev’s Inequality

https://en.wikipedia.org/wiki/Chebyshev%27s_inequality

Theorem (Chebyshev's Inequality)

Let $μ = E (X)$ and $σ^{2} = Var (X)$ . Then,
$P (∣ X - μ ∣ \geq t) \leq \frac{σ ^{2}}{t ^{2}} and P (∣ Z ∣ \geq k) \leq \frac{1}{k ^{2}}$
where $Z = (X - μ) / σ$ . In particular, $P (∣ Z ∣ > 2) \leq \frac{1}{4}$ and $P (∣ Z ∣ > 3) \leq \frac{1}{9}$ .

One way to think about Chebyshev’s inequality is that the variance is a budget of squared distance from the mean, and a tail event spends that budget very quickly. On the event ${∣ X - μ ∣ \geq t}$ , the squared deviation $(X - μ)^{2}$ is at least $t^{2}$ . So if that event happens with probability $p$ , it contributes at least $t^{2} p$ to the average squared deviation:

σ^{2} = E [(X - μ)^{2}] 1_{{∣ X - μ ∣ \geq t}} \geq t^{2} P (∣ X - μ ∣ \geq t) .

Rearranging gives the result.

Example

Suppose we test a prediction method like a neural network on a set of $n$ new test cases. Let $X_{i} = 1$ if the prediction is wrong, and $X_{i} = 0$ if the predictor is correct. Then $\overset{ˉ}{X}_{n} = n^{- 1} \sum_{i} X_{i}$ is the observed error rate. Each $X_{i}$ may be regarded as a Bernoulli with unknown mean $p$ . We would like to know the true — but unknown — error rate $p$ . Intuitively, we expect that $\overset{ˉ}{X}_{n}$ should be close to $p$ . How likely is $\overset{ˉ}{X}_{n}$ to not be within $ϵ$ of $p$ ? We have $V (\overset{ˉ}{X}_{n}) = V (X_{1}) / n = p (1 - p) / n$ , and
$P (∣ \overline{X}_{n} - p ∣ > ϵ) \leq \frac{V ( X _{n} )}{ϵ ^{2}} = \frac{p ( 1 - p )}{n ϵ ^{2}} \leq \frac{1}{4 n ϵ ^{2}}$
since $p (1 - p) \leq \frac{1}{4}$ for all $p$ . For $ϵ = 0.2$ and $n = 100$ , the bound is 0.0625.

Hoeffding’s Lemma

https://en.wikipedia.org/wiki/Hoeffding%27s_lemma

Hoeffding's Lemma

Let $X$ be any real-values random variable such that $a \leq X \leq b$ . Then for all $t \in R$ ,
$E [e^{tX}] \leq exp (t E [X] + \frac{t ^{2} ( b - a ) ^{2}}{8}),$
or equivalently,
$E [e^{t (X - E [X])}] \leq exp (\frac{t ^{2} ( b - a ) ^{2}}{8}) .$

If $X$ is bounded in $[a, b]$ , then the centered variable $X - E [X]$ has an MGF bounded like a Gaussian.

When coming up with bounds, a common trick is to use the Chernoff / exponential Markov template:

Pick what you want to control, usually a sum/mean: $S = \sum (X_{i} - E [X_{i}])$
Exponentiate to make it non-negative and allows Markov to be applicable. For any $t > 0$ ,

P (S \geq ϵ) = P (e^{tS} \geq e^{t ϵ}) \leq \frac{E [ e ^{tS} ]}{e ^{t ϵ}}

Bound the moment generating function $E [e^{tS}]$ : If each $X_{i}$ are independent, then a product can be pulled out of the exponentiated sum. Then we bound each factor using something like Hoeffding’s Lemma

Hoeffding’s Inequality

https://en.wikipedia.org/wiki/Hoeffding%27s_inequality

Theorem (Hoeffding's Inequality)

Let $X_{1}, \dots, X_{n}$ be independent observations such that $a_{i} \leq X_{i} \leq b_{i}$ . Let $S_{n} = X_{1} + \dots + X_{n}$ . Then, for any $ϵ > 0$ ,
$P (S_{n} - E [S_{n}] \geq ϵ) \leq exp (- \frac{2 ϵ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})$ $P (∣ S_{n} - E [S_{n}] ∣ \geq ϵ) \leq 2 exp (- \frac{2 ϵ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})$

Hoeffding’s inequality can be used when we have independence and bounded observations, and we want high-probability finite-sample bounds. If the data is unbounded or is heavy-tailed, Hoeffding’s inequality can be invalid or overly conservative.

Proof

For any $t > 0$ , using that $e^{t (\cdot)}$ is increasing:
$P (S_{n} - E [S_{n}] \geq ϵ) = P (e^{t (S_{n} - E [S_{n}])} \geq e^{t ϵ}) .$
Applying Markov’s Inequality to the non-negative random variable $e^{t (S_{n} - E [S_{n}])}$ ,
$P (e^{t (S_{n} - E [S_{n}])} \geq e^{t ϵ}) \leq \frac{E [ e ^{t (S_{n} - E [S_{n}])} ]}{e ^{t ϵ}} = e^{- t ϵ} E [e^{t \sum_{i = 1}^{n} Y_{i}}]$
where $Y_{i} = X_{i} - E [X_{i}]$ , with $E [Y_{i}] = 0$ and $a_{i} - E [X_{i}] \leq Y_{i} \leq b_{i} - E [X_{i}]$ . If the $X_{i}$ are independent, so are the $Y_{i}$ , and
$E [e^{t \sum_{i = 1}^{n} Y_{i}}] = i = 1 \prod n E [e^{t Y_{i}}] .$
This gives us
$P (S_{n} - E [S_{n}] \geq ϵ) \leq e^{- t ϵ} i = 1 \prod n E [e^{t Y_{i}}] .$
We can then apply Hoeffdings lemma to $Y_{i}$ . Its range is $(β - α) = (b_{i} - E [X_{i}]) - (a_{i} - E [X_{i}]) = b_{i} - a_{i}$ , so
$E [e^{t Y_{i}}] \leq exp (\frac{t ^{2} ( b _{i} - a _{i} ) ^{2}}{8}) .$
Plugging this back in gives
$P (S_{n} - E [S_{n}] \geq ϵ) \leq e^{- t ϵ} i = 1 \prod n exp (\frac{t ^{2} ( b _{i} - a _{i} ) ^{2}}{8}) .$
Combining the product into an exponential sum:
$P (S_{n} - E [S_{n}] \geq ϵ) \leq exp (- t ϵ + \frac{t ^{2}}{8} i = 1 \sum n (b_{i} - a_{i})^{2}) .$
We now choose $t$ to make the bound as small as possible. Let $V = \sum_{i = 1}^{n} (b_{i} - a_{i})^{2}$ . We want to minimize the exponent $f (t) = - t ϵ + \frac{t ^{2} V}{8}$ . Differentiating and setting to zero: $f^{'} (t) = - ϵ + \frac{t V}{4} = 0 \Rightarrow t^{*} = \frac{4 ϵ}{V} .$ Plugging back in,
$f (t^{*}) = - ϵ \frac{4 ϵ}{V} + \frac{V}{8} (\frac{4 ϵ}{V})^{2} = - \frac{2 ϵ ^{2}}{V} .$
Therefore,
$P (S_{n} - E [S_{n}] \geq ϵ) \leq exp (- \frac{2 ϵ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}) .$

The two-sided Hoeffding bound can be easily recovered by applying the same inequality to $- X_{i}$ , where $- b_{i} \leq - X_{i} \leq - a_{i}$ , so we also get

P (E [S_{n}] - S_{n} \geq ϵ) = P (S_{n} - E [S_{n}] \leq - ϵ) \leq exp (- \frac{2 ϵ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}}) .

and using the union bound:

P (∣ S_{n} - E [S_{n}] ∣ \geq ϵ) \leq P (S_{n} - E [S_{n}] \geq ϵ) + P (E [S_{n}] - S_{n} \leq - ϵ) \leq 2 exp (- \frac{2 ϵ ^{2}}{\sum _{i = 1}^{n} ( b _{i} - a _{i} ) ^{2}})

Theorem

Let $X_{1}, \dots, X_{n} \sim Bernoulli (p)$ . Then for any $ϵ > 0$ ,
$P (∣ \overline{X}_{n} - p ∣ > ϵ) \leq 2 e^{- 2 n ϵ^{2}}$
where $\overline{X}_{n} = n^{- 1} \sum_{i = 1}^{n} X_{i}$ .

From Hoeffding to Confidence Intervals

Hoeffding’s inequality also gives a simple way of creating a confidence interval for a binomial parameter $p$ . Fix $α > 0$ and let

ϵ_{n} = \frac{1}{2 n} lo g (\frac{2}{α}) .

By Hoeffding’s inequality,

P (∣ \overline{X}_{n} - p ∣ > ϵ) \leq 2 e^{- 2 n ϵ^{2}} = α .

Let $C = (\overline{X}_{n} - ϵ_{n}, \overline{X}_{n} + ϵ_{n})$ . Then $P (p \neq \in C) = P (∣ \overline{X}_{n} - p ∣ > ϵ_{n}) \leq α$ . Hence, $P (p \in C) \geq 1 - α$ , and we call $C$ a $1 - α$ confidence interval.

Mill’s Inequality

The following inequality is useful for bounding probability statements about Normal random variables.

Theorem (Mill's Inequality)

Let $Z \sim N (0, 1)$ . Then,
$P (∣ Z ∣ > t) \leq \frac{2}{π} \frac{e ^{- t^{2} /2}}{t} .$

Jake Tuero

Explorer

Probability Inequalities

Probability Inequalities

Markov’s Inequality

Chebyshev’s Inequality

Hoeffding’s Lemma

Hoeffding’s Inequality

From Hoeffding to Confidence Intervals

Mill’s Inequality

Table of Contents