Linear Regression

Linear regression is one of the standard models for supervised learning when the response is continuous. The goal is to explain how the conditional mean of a response variable changes as a function of several predictors, while keeping the model simple enough to estimate, interpret, and test.

Suppose we observe pairs $(x_{i}, y_{i})$ for $i = 1, \dots, n$ , where $x_{i} \in R^{p}$ is a vector of predictors and $y_{i} \in R$ is the response. In multiple linear regression, we write

y_{i} = β_{0} + x_{i}^{T} β + ε_{i}

where $β_{0} \in R$ is the intercept, $β \in R^{p}$ is the coefficient vector, and $ε_{i}$ is the error term.

In matrix notation, this is

Y = Xβ + ε

where $Y \in R^{n}$ , $X \in R^{n \times (p + 1)}$ includes a column of ones for the intercept, $β \in R^{p + 1}$ , and $ε \in R^{n}$ .

Definition

The conditional mean under the linear regression model is
$E (Y ∣ X) = Xβ .$
This means the model assumes the mean response is a linear function of the predictors.

Interpretation

The coefficient $β_{j}$ is the change in the mean response associated with a one-unit increase in predictor $x_{j}$ , holding the other predictors fixed.

Properties

Some of the most useful properties of the linear model are easiest to state for the ordinary least squares estimator.

Definition (Ordinary Least Squares)

The ordinary least squares (OLS) estimator chooses $\hat{β}$ to minimize the residual sum of squares
$RSS (β) = i = 1 \sum n (y_{i} - x_{i}^{T} β)^{2} = ∥ Y - Xβ ∥_{2}^{2} .$

If $X$ has full column rank, then the OLS estimator is unique and satisfies the normal equations

X^{T} X \hat{β} = X^{T} Y .

Hence,

\hat{β} = (X^{T} X)^{- 1} X^{T} Y .

Let $\hat{Y} = X \hat{β}$ and let $e = Y - \hat{Y}$ denote the residual vector.

Important properties:

The fitted values are the projection of $Y$ onto the column space of $X$ .
The residual vector is orthogonal to every column of $X$ , so $X^{T} e = 0$ .
If the model is correctly specified and $E (ε ∣ X) = 0$ , then $\hat{β}$ is unbiased.
If additionally $V (ε ∣ X) = σ^{2} I$ , then

V (\hat{β} ∣ X) = σ^{2} (X^{T} X)^{- 1} .

Under the same homoscedasticity assumption, OLS is the best linear unbiased estimator (BLUE) by the Gauss-Markov theorem.

Intuition

OLS picks the hyperplane that makes the observed responses as close as possible, in squared distance, to the fitted values. The projection viewpoint explains many algebraic facts automatically: the residual must be orthogonal to the subspace onto which we projected.

Two common goodness-of-fit summaries are

R^{2} = 1 - \frac{RSS}{TSS}

and

Adjusted R^{2} = 1 - \frac{RSS / ( n - p - 1 )}{TSS / ( n - 1 )}

where $TSS = \sum_{i = 1}^{n} (y_{i} - \overset{y}{ˉ})^{2}$ .

The quantity $TSS$ measures the total variability in the response around its sample mean, while $RSS$ measures the variability left unexplained after fitting the regression model. Hence, $R^{2}$ measures the fraction of the variability in $Y$ that is explained by the fitted linear model.

Interpretation

In the usual regression setting with an intercept, $R^{2}$ typically lies in $[0, 1]$ .

$R^{2} = 0$ means the model does no better than predicting the sample mean $\overset{y}{ˉ}$ for every observation.

$R^{2} = 1$ means a perfect fit on the training data.

Larger values are preferred, since they indicate that less variation is left unexplained by the model.

Adjusted $R^{2}$ plays a similar role, but it penalizes adding predictors that do not improve the fit enough to justify the additional complexity. Because of this penalty, adjusted $R^{2}$ can decrease when a new predictor is added.

Warning

Adding predictors can only increase $R^{2}$ , so a larger $R^{2}$ does not necessarily mean a better model. Adjusted $R^{2}$ , out-of-sample error, or explicit model comparison procedures are usually more informative.

This happens because a larger model can always reproduce a smaller one by setting the extra coefficients to zero. Thus, when OLS minimizes RSS over the larger model class, the new minimum RSS cannot be worse than the old one, and since $R^{2} = 1 - RSS / TSS$ , $R^{2}$ can only increase or stay the same.

Estimation Methods

Least Squares

The most common way to estimate the coefficients is OLS, which minimizes

RSS (β) = ∥ Y - Xβ ∥_{2}^{2} .

This is closely related to the predictor version of MSE. If $\hat{Y} = Xβ$ , then

MSE (β) = \frac{1}{n} i = 1 \sum n (y_{i} - \overset{y}{^}_{i})^{2} = \frac{RSS ( β )}{n} .

Since $1/ n$ is just a constant, minimizing RSS and minimizing training MSE lead to the same solution.

When $X^{T} X$ is invertible, the solution is

\hat{β}_{OLS} = (X^{T} X)^{- 1} X^{T} Y .

An estimator of the error variance is

\overset{σ}{^}^{2} = \frac{RSS ( β ^ )}{n - p - 1} .

This denominator reflects the loss of one degree of freedom for the intercept and one for each predictor coefficient.

Maximum Likelihood Estimation

If we strengthen the model and assume

ε ∣ X \sim N (0, σ^{2} I),

then

Y ∣ X \sim N (Xβ, σ^{2} I) .

In this case we can estimate the parameters by maximum likelihood. The log-likelihood is, up to an additive constant,

ℓ (β, σ^{2}) = - \frac{n}{2} lo g σ^{2} - \frac{1}{2 σ ^{2}} (Y - Xβ)^{T} (Y - Xβ) .

For fixed $σ^{2}$ , maximizing the log-likelihood over $β$ is equivalent to minimizing RSS, so

\hat{β}_{MLE} = \hat{β}_{OLS} .

Thus, OLS and MLE give the same coefficient estimates under Gaussian errors.

The MLE of the variance is

\overset{σ}{^}_{MLE}^{2} = \frac{RSS ( β ^ )}{n},

which differs from the usual unbiased estimator $RSS / (n - p - 1)$ .

Interpretation

Least squares does not require Normal errors to define the estimator. Normality only matters when we want the likelihood formulation and exact finite-sample inference based on $t$ and $F$ distributions.

When the Gaussian model is appropriate, standard inference for coefficients uses

\hat{β}_{j} \pm t_{n - p - 1, α /2} \cdot SE (\hat{β}_{j})

for confidence intervals, and $t$ -tests or $F$ -tests for hypotheses about one or more coefficients.

Model Selection

Suppose we have several candidate regression models. The right comparison criterion depends on whether the goal is explanation, hypothesis testing, or prediction.

Common approaches:

Adjusted $R^{2}$ : useful when comparing models with different numbers of predictors. Unlike $R^{2}$ , it penalizes unnecessary complexity.
AIC / BIC: likelihood-based criteria that trade off fit and complexity. Smaller values are preferred. BIC penalizes complexity more heavily than AIC.
Nested-model $F$ -tests: useful when one model is a special case of another. This tests whether the extra predictors in the larger model provide a meaningful reduction in RSS.
Cross-validation: usually the best choice when the main goal is prediction. Compare models using validation or cross-validation error rather than training error.
Test-set metrics: if a clean holdout set is available, compare models using out-of-sample MSE, MAE, or related prediction metrics.

For nested models,

M_{0} \subset M_{1},

with $q$ additional predictors in the larger model, the usual $F$ statistic is

F = \frac{( RSS _{0} - RSS _{1} ) / q}{RSS _{1} / ( n - p _{1} - 1 )}

where $p_{1}$ is the number of non-intercept predictors in the larger model.

Rule of Thumb

If your goal is interpretation, prefer a smaller model that is stable and scientifically defensible. If your goal is prediction, prefer the model with the best out-of-sample performance.

Warning

Stepwise procedures can be useful for exploration, but they can also produce unstable models and overly optimistic significance statements. It is better to combine automated selection with domain knowledge and validation.

Model Assumptions

The assumptions below are the main ones that come up in interviews and in practice. Some are required for unbiased estimation, while others are mainly required for reliable standard errors and hypothesis tests.

Assumption	Meaning	How to check	If violated
Linearity	The conditional mean is linear in the predictors: $E (Y ∣ X) = Xβ$ .	Residuals vs fitted plots, residuals vs each predictor, domain knowledge.	Coefficients can be systematically misleading and predictions can be biased. Add transformations, interactions, or a different model class.
Exogeneity / zero conditional mean	$E (ε ∣ X) = 0$ . Equivalently, predictors are uncorrelated with the error.	This is mostly a design assumption; think about omitted variables, simultaneity, and measurement error. Residual plots alone cannot prove it.	OLS is biased and inconsistent. This is usually the most serious violation for causal or explanatory use.
No perfect multicollinearity	No predictor is an exact linear combination of the others, so $X^{T} X$ is invertible.	Check correlations, dummy-variable traps, rank deficiency, VIF, or condition number.	Coefficients may be unidentifiable or numerically unstable; standard errors can become large.
Homoscedasticity	$V (ε ∣ X) = σ^{2} I$ .	Residuals vs fitted plots, scale-location plots, Breusch-Pagan or White tests.	OLS coefficients remain unbiased under exogeneity, but the usual standard errors are wrong. Use robust standard errors or a better variance model.
Independence	Errors are independent across observations.	Think about the sampling process; for ordered data use residual autocorrelation plots or Durbin-Watson-type diagnostics.	Standard errors are often wrong, and efficiency drops. Use clustered or time-series methods when dependence is present.
Normality of errors	Errors are Gaussian conditional on $X$ .	Q-Q plots, histograms of residuals, normality tests in large enough samples.	OLS is still well defined without Normality, but exact small-sample $t$ and $F$ inference is no longer justified. Large-sample approximations are often still fine.

Some additional practical diagnostics are worth remembering:

High leverage points are observations with unusual predictor values.
Influential points are observations that substantially change the fitted model; Cook’s distance is a common diagnostic.
Outliers in the response can distort the fit because OLS squares residuals and therefore weights large deviations heavily.

Interview Summary

If the interviewer asks for the assumptions of linear regression, the safest compact answer is: linearity, zero-mean errors conditional on the predictors, no perfect multicollinearity, constant variance, independence, and Normality if you want exact classical inference.

Warning

A common mistake is to say that Normality is required for OLS to work at all. It is not. Normality is mainly used for likelihood-based modeling and exact finite-sample inference, not for defining the least squares estimator.

Jake Tuero

Explorer

Linear Regression

Linear Regression

Properties

Estimation Methods

Least Squares

Maximum Likelihood Estimation

Model Selection

Model Assumptions

Table of Contents