Logistic Regression

Logistic regression is one of the standard models for supervised learning when the response is binary. The goal is to model the conditional probability of a class label as a function of several predictors, while ensuring that the predicted probabilities always lie in $[0, 1]$ .

Suppose we observe pairs $(x_{i}, y_{i})$ for $i = 1, \dots, n$ , where $x_{i} \in R^{p}$ is a vector of predictors and $y_{i} \in {0, 1}$ is the response. In binary logistic regression, we write

P (Y_{i} = 1 ∣ x_{i}) = π (x_{i}) = σ (β_{0} + x_{i}^{T} β)

where

σ (z) = \frac{1}{1 + e ^{- z}}

is the logistic function, $β_{0} \in R$ is the intercept, and $β \in R^{p}$ is the coefficient vector.

Equivalently,

lo g (\frac{π ( x )}{1 - π ( x )}) = β_{0} + x^{T} β .

Definition

The conditional mean under the logistic regression model is
$E (Y ∣ X = x) = P (Y = 1 ∣ X = x) = π (x) .$
Since $Y \in {0, 1}$ , the conditional mean is the conditional probability of class 1.

Interpretation

The coefficients are linear on the log-odds scale, not on the probability scale. A one-unit increase in predictor $x_{j}$ changes the log-odds by $β_{j}$ , holding the other predictors fixed.

Log-Odds

The idea of log-odds is easiest to understand in two steps.

First, if an event happens with probability $p$ , then its odds are

odds = \frac{p}{1 - p} .

This compares the probability that the event happens to the probability that it does not happen.

For example:

If $p = 0.5$ , then the odds are $0.5/0.5 = 1$ .
If $p = 0.8$ , then the odds are $0.8/0.2 = 4$ .
If $p = 0.2$ , then the odds are $0.2/0.8 = 0.25$ .

Second, the log-odds is just the logarithm of the odds:

lo g (\frac{p}{1 - p}) .

This transformation is useful because probabilities are constrained to lie in $[0, 1]$ , while log-odds can take any real value in $(- \infty, \infty)$ . That makes it natural to model the log-odds as a linear function of the predictors.

Intuition

Probabilities are bounded, so modeling them directly with a linear function is awkward. The log-odds transforms a probability into something unbounded, and logistic regression assumes that this transformed quantity changes linearly with the predictors.

Some useful reference points are:

If $p = 0.5$ , then the log-odds is $0$ .
If $p > 0.5$ , then the log-odds is positive.
If $p < 0.5$ , then the log-odds is negative.
As $p \to 1$ , the log-odds goes to $\infty$ .
As $p \to 0$ , the log-odds goes to $- \infty$ .

So when logistic regression writes

lo g (\frac{π ( x )}{1 - π ( x )}) = β_{0} + x^{T} β,

it is saying that the predictors act linearly on the log-odds of class 1, not directly on the probability itself.

Properties

Some of the most useful properties of logistic regression come from its probability interpretation.

Important properties:

The model predicts probabilities in $[0, 1]$ , unlike ordinary linear regression applied to binary data.
The decision boundary is linear in the predictors: predicting class 1 when $π (x) > 0.5$ is equivalent to checking whether $β_{0} + x^{T} β > 0$ .
The coefficients have a simple odds-ratio interpretation:

e^{β_{j}}

is the multiplicative change in the odds for a one-unit increase in $x_{j}$ , holding the other predictors fixed. 4. Logistic regression is a generalized linear model with Bernoulli response, logit link, and linear predictor $β_{0} + x^{T} β$ .

Intuition

Logistic regression starts with a linear score $β_{0} + x^{T} β$ , then passes it through the logistic function so that the output becomes a valid probability. Very negative scores map near 0, very positive scores map near 1, and a score of 0 maps to probability $0.5$ .

Unlike linear regression, there is no single universally used goodness-of-fit summary analogous to $R^{2}$ . Common summaries include the log-likelihood, classification accuracy, precision/recall, ROC-AUC, and pseudo- $R^{2}$ measures.

Warning

Accuracy alone can be misleading, especially for imbalanced data. For example, if 95% of observations belong to class 0, a classifier that always predicts 0 has 95% accuracy but may be useless.

Estimation Methods

Maximum Likelihood Estimation

The standard way to estimate the coefficients is by maximum likelihood. If the observations are conditionally independent given the predictors, then the likelihood is

L (β) = i = 1 \prod n π (x_{i})^{y_{i}} (1 - π (x_{i}))^{1 - y_{i}} .

The log-likelihood is

ℓ (β) = i = 1 \sum n [y_{i} lo g π (x_{i}) + (1 - y_{i}) lo g (1 - π (x_{i}))] .

The MLE is the value of $β$ that maximizes $ℓ (β)$ .

Unlike ordinary least squares in linear regression, logistic regression does not usually have a closed-form solution. Instead, the coefficients are computed numerically, often by Newton-Raphson, iteratively reweighted least squares (IRLS), or gradient-based optimization.

Warning

Perfect class separation can be a problem for unregularized logistic regression. If some linear boundary separates the two classes exactly, the likelihood can keep increasing as the corresponding coefficients grow in magnitude, so the usual finite MLE may not exist even though the classification problem looks easy.

Cross-Entropy / Log Loss

Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood, also called the cross-entropy loss or log loss:

- ℓ (β) = - i = 1 \sum n [y_{i} lo g π (x_{i}) + (1 - y_{i}) lo g (1 - π (x_{i}))] .

If we divide by $n$ , we get the average log loss,

\frac{- ℓ ( β )}{n},

and this leads to the same solution because dividing by $n$ only rescales the objective by a constant.

If we write

z_{i} = β_{0} + x_{i}^{T} β

and

π (x_{i}) = σ (z_{i}),

then differentiating the negative log-likelihood gives

\frac{\partial ( - ℓ ( β ))}{\partial β} = i = 1 \sum n (π (x_{i}) - y_{i}) x_{i} .

In matrix notation,

\nabla (- ℓ (β)) = X^{T} (π - Y)

where $π$ is the vector of predicted probabilities.

For the average log loss, the gradient is just

\nabla (\frac{- ℓ ( β )}{n}) = \frac{1}{n} X^{T} (π - Y) .

Intuition

The term $π (x_{i}) - y_{i}$ is the prediction error in probability space. If the model predicts a probability that is too large relative to the true label, the gradient pushes the parameters in the direction that decreases that probability, and vice versa.

Interpretation

Log loss strongly penalizes confident wrong predictions. Predicting a probability near 1 for an observation whose true label is 0 produces a large loss.

Inference

Under regularity conditions, the MLE is asymptotically Normal, so coefficient inference is usually based on Wald intervals/tests or likelihood ratio tests.

For a single coefficient $β_{j}$ , a common large-sample confidence interval is

\hat{β}_{j} \pm z_{α /2} \cdot SE (\hat{β}_{j}) .

Model Selection

Suppose we have several candidate logistic regression models. The right comparison criterion depends on whether the goal is explanation, classification performance, or calibrated probability estimation.

Common approaches:

AIC / BIC: natural choices for logistic regression because the model is likelihood-based.
Cross-validation: usually preferred when the main goal is predictive performance.
Log loss: useful when we care about the quality of predicted probabilities, not just the final class labels.
ROC-AUC / PR-AUC: useful when ranking quality matters, especially with class imbalance.
Likelihood ratio tests: useful for comparing nested logistic regression models.

For nested models,

M_{0} \subset M_{1},

the likelihood ratio statistic is

λ = - 2 lo g (\frac{L ( β ^ _{0} )}{L ( β ^ _{1} )})

where $\hat{β}_{0}$ and $\hat{β}_{1}$ are the MLEs under the smaller and larger models respectively. Under standard regularity conditions, this statistic is approximately Chi-squared under the null.

Rule of Thumb

If you want well-calibrated probabilities, compare models using log loss or related likelihood-based criteria. If you care mostly about classification quality, use cross-validation with metrics that match the application.

Warning

A low training error does not guarantee good probabilities or good generalization. In logistic regression, it is common to separately think about discrimination, calibration, and out-of-sample performance.

Model Assumptions

The assumptions below are the main ones that come up in interviews and in practice.

Assumption	Meaning	How to check	If violated
Binary response	The outcome takes values in ${0, 1}$ for standard binary logistic regression.	Check how the response is encoded.	The model is not appropriate as stated; use multinomial, ordinal, or another model if needed.
Independent observations	Conditional on the predictors, the observations are independent.	Think about the sampling process; repeated measures or clustered data are warning signs.	Standard errors and tests can be wrong. Consider clustered or mixed-effects methods.
Correct mean specification	The log-odds are linear in the predictors: $lo g (π / (1 - π)) = β_{0} + x^{T} β$ .	Residual plots, calibration plots, domain knowledge, and checking transformations/interactions.	Predicted probabilities can be biased and coefficient interpretations can be misleading.
No perfect multicollinearity	No predictor is an exact linear combination of the others.	Check correlations, rank deficiency, VIF, or condition number.	Coefficients may be unidentifiable or numerically unstable.
No complete separation	The predictors do not perfectly separate the two classes.	Watch for fitted probabilities collapsing to 0 or 1 and optimization warnings.	The MLE may not exist or coefficients may diverge to infinity. Regularization or a different model may be needed.
Large enough sample for asymptotic inference	Wald tests and asymptotic Normal approximations rely on enough information in the data.	Check sample size, especially the number of events in the rare class.	Standard large-sample inference can be unstable or misleading.

Some additional practical diagnostics are worth remembering:

Calibration asks whether predicted probabilities match observed frequencies.
Discrimination asks whether the model ranks positive cases above negative ones.
Influential observations can still have a large effect on the fitted coefficients and predicted probabilities.

Interview Summary

If the interviewer asks for the assumptions of logistic regression, the safest compact answer is: binary outcome, independent observations, linearity on the log-odds scale, no perfect multicollinearity, no complete separation, and enough data for stable likelihood-based inference.

Warning

A common mistake is to say that logistic regression assumes the response itself is linearly related to the predictors. It does not. The linearity assumption is on the log-odds, not directly on the probability.

Jake Tuero

Explorer

Logistic Regression

Logistic Regression

Log-Odds

Properties

Estimation Methods

Maximum Likelihood Estimation

Cross-Entropy / Log Loss

Inference

Model Selection

Model Assumptions

Table of Contents