Logistic Regression
Logistic regression is one of the standard models for supervised learning when the response is binary. The goal is to model the conditional probability of a class label as a function of several predictors, while ensuring that the predicted probabilities always lie in .
Suppose we observe pairs for , where is a vector of predictors and is the response. In binary logistic regression, we write
where
is the logistic function, is the intercept, and is the coefficient vector.
Equivalently,
Definition
The conditional mean under the logistic regression model is
Since , the conditional mean is the conditional probability of class 1.
Interpretation
The coefficients are linear on the log-odds scale, not on the probability scale. A one-unit increase in predictor changes the log-odds by , holding the other predictors fixed.
Log-Odds
The idea of log-odds is easiest to understand in two steps.
First, if an event happens with probability , then its odds are
This compares the probability that the event happens to the probability that it does not happen.
For example:
- If , then the odds are .
- If , then the odds are .
- If , then the odds are .
Second, the log-odds is just the logarithm of the odds:
This transformation is useful because probabilities are constrained to lie in , while log-odds can take any real value in . That makes it natural to model the log-odds as a linear function of the predictors.
Intuition
Probabilities are bounded, so modeling them directly with a linear function is awkward. The log-odds transforms a probability into something unbounded, and logistic regression assumes that this transformed quantity changes linearly with the predictors.
Some useful reference points are:
- If , then the log-odds is .
- If , then the log-odds is positive.
- If , then the log-odds is negative.
- As , the log-odds goes to .
- As , the log-odds goes to .
So when logistic regression writes
it is saying that the predictors act linearly on the log-odds of class 1, not directly on the probability itself.
Properties
Some of the most useful properties of logistic regression come from its probability interpretation.
Important properties:
- The model predicts probabilities in , unlike ordinary linear regression applied to binary data.
- The decision boundary is linear in the predictors: predicting class 1 when is equivalent to checking whether .
- The coefficients have a simple odds-ratio interpretation:
is the multiplicative change in the odds for a one-unit increase in , holding the other predictors fixed. 4. Logistic regression is a generalized linear model with Bernoulli response, logit link, and linear predictor .
Intuition
Logistic regression starts with a linear score , then passes it through the logistic function so that the output becomes a valid probability. Very negative scores map near 0, very positive scores map near 1, and a score of 0 maps to probability .
Unlike linear regression, there is no single universally used goodness-of-fit summary analogous to . Common summaries include the log-likelihood, classification accuracy, precision/recall, ROC-AUC, and pseudo- measures.
Warning
Accuracy alone can be misleading, especially for imbalanced data. For example, if 95% of observations belong to class 0, a classifier that always predicts 0 has 95% accuracy but may be useless.
Estimation Methods
Maximum Likelihood Estimation
The standard way to estimate the coefficients is by maximum likelihood. If the observations are conditionally independent given the predictors, then the likelihood is
The log-likelihood is
The MLE is the value of that maximizes .
Unlike ordinary least squares in linear regression, logistic regression does not usually have a closed-form solution. Instead, the coefficients are computed numerically, often by Newton-Raphson, iteratively reweighted least squares (IRLS), or gradient-based optimization.
Warning
Perfect class separation can be a problem for unregularized logistic regression. If some linear boundary separates the two classes exactly, the likelihood can keep increasing as the corresponding coefficients grow in magnitude, so the usual finite MLE may not exist even though the classification problem looks easy.
Cross-Entropy / Log Loss
Maximizing the log-likelihood is equivalent to minimizing the negative log-likelihood, also called the cross-entropy loss or log loss:
If we divide by , we get the average log loss,
and this leads to the same solution because dividing by only rescales the objective by a constant.
If we write
and
then differentiating the negative log-likelihood gives
In matrix notation,
where is the vector of predicted probabilities.
For the average log loss, the gradient is just
Intuition
The term is the prediction error in probability space. If the model predicts a probability that is too large relative to the true label, the gradient pushes the parameters in the direction that decreases that probability, and vice versa.
Interpretation
Log loss strongly penalizes confident wrong predictions. Predicting a probability near 1 for an observation whose true label is 0 produces a large loss.
Inference
Under regularity conditions, the MLE is asymptotically Normal, so coefficient inference is usually based on Wald intervals/tests or likelihood ratio tests.
For a single coefficient , a common large-sample confidence interval is
Model Selection
Suppose we have several candidate logistic regression models. The right comparison criterion depends on whether the goal is explanation, classification performance, or calibrated probability estimation.
Common approaches:
- AIC / BIC: natural choices for logistic regression because the model is likelihood-based.
- Cross-validation: usually preferred when the main goal is predictive performance.
- Log loss: useful when we care about the quality of predicted probabilities, not just the final class labels.
- ROC-AUC / PR-AUC: useful when ranking quality matters, especially with class imbalance.
- Likelihood ratio tests: useful for comparing nested logistic regression models.
For nested models,
the likelihood ratio statistic is
where and are the MLEs under the smaller and larger models respectively. Under standard regularity conditions, this statistic is approximately Chi-squared under the null.
Rule of Thumb
If you want well-calibrated probabilities, compare models using log loss or related likelihood-based criteria. If you care mostly about classification quality, use cross-validation with metrics that match the application.
Warning
A low training error does not guarantee good probabilities or good generalization. In logistic regression, it is common to separately think about discrimination, calibration, and out-of-sample performance.
Model Assumptions
The assumptions below are the main ones that come up in interviews and in practice.
| Assumption | Meaning | How to check | If violated |
|---|---|---|---|
| Binary response | The outcome takes values in for standard binary logistic regression. | Check how the response is encoded. | The model is not appropriate as stated; use multinomial, ordinal, or another model if needed. |
| Independent observations | Conditional on the predictors, the observations are independent. | Think about the sampling process; repeated measures or clustered data are warning signs. | Standard errors and tests can be wrong. Consider clustered or mixed-effects methods. |
| Correct mean specification | The log-odds are linear in the predictors: . | Residual plots, calibration plots, domain knowledge, and checking transformations/interactions. | Predicted probabilities can be biased and coefficient interpretations can be misleading. |
| No perfect multicollinearity | No predictor is an exact linear combination of the others. | Check correlations, rank deficiency, VIF, or condition number. | Coefficients may be unidentifiable or numerically unstable. |
| No complete separation | The predictors do not perfectly separate the two classes. | Watch for fitted probabilities collapsing to 0 or 1 and optimization warnings. | The MLE may not exist or coefficients may diverge to infinity. Regularization or a different model may be needed. |
| Large enough sample for asymptotic inference | Wald tests and asymptotic Normal approximations rely on enough information in the data. | Check sample size, especially the number of events in the rare class. | Standard large-sample inference can be unstable or misleading. |
Some additional practical diagnostics are worth remembering:
- Calibration asks whether predicted probabilities match observed frequencies.
- Discrimination asks whether the model ranks positive cases above negative ones.
- Influential observations can still have a large effect on the fitted coefficients and predicted probabilities.
Interview Summary
If the interviewer asks for the assumptions of logistic regression, the safest compact answer is: binary outcome, independent observations, linearity on the log-odds scale, no perfect multicollinearity, no complete separation, and enough data for stable likelihood-based inference.
Warning
A common mistake is to say that logistic regression assumes the response itself is linearly related to the predictors. It does not. The linearity assumption is on the log-odds, not directly on the probability.