Naive Bayes Classifier

The naive Bayes classifier is a simple probabilistic classifier based on Bayes’ theorem. It is especially useful when we want a fast baseline for classification, when the feature dimension is large, or when the data are naturally modeled with simple class-conditional distributions.

Suppose the class label is $Y \in {1, \dots, K}$ and the feature vector is $X = (X_{1}, \dots, X_{p})$ . The classifier models the posterior probability of a class by

P (Y = k ∣ X = x) = \frac{P ( Y = k ) f ( x ∣ Y = k )}{\sum _{c = 1}^{K} P ( Y = c ) f ( x ∣ Y = c )} .

The key simplifying assumption is that the features are conditionally independent given the class. Under that assumption,

f (x ∣ Y = k) = j = 1 \prod p f (x_{j} ∣ Y = k) .

Thus,

P (Y = k ∣ X = x) \propto P (Y = k) j = 1 \prod p f (x_{j} ∣ Y = k) .

Definition

The naive Bayes classifier predicts the class with largest posterior probability:
$\overset{y}{^} (x) = ar g k \in {1, \dots, K} max P (Y = k ∣ X = x) .$

Interpretation

Naive Bayes is a generative classifier. It models how features are generated within each class, then uses Bayes’ theorem to turn those class-conditional models into posterior class probabilities.

Conditional Independence

The adjective “naive” refers to the assumption that once we know the class label, the features no longer provide additional information about one another.

Formally, the model assumes that for each class $k$ ,

X_{1}, \dots, X_{p} are conditionally independent given Y = k .

This means that if we know the class, the joint density or mass function factors into a product of one-dimensional terms:

f (x ∣ Y = k) = j = 1 \prod p f (x_{j} ∣ Y = k) .

Intuition

Without the conditional independence assumption, modeling $f (x ∣ Y = k)$ directly can be very difficult in high dimensions. Naive Bayes replaces one hard high-dimensional modeling problem with many easier one-dimensional modeling problems.

This assumption is usually false in real data, but the classifier can still work surprisingly well in practice.

Warning

Conditional independence is much stronger than ordinary independence. The model does not assume the features are marginally independent; it assumes they are independent after conditioning on the class.

Properties

Important properties:

Naive Bayes is usually very fast to train and evaluate.
It works naturally for multiclass classification.
It often performs well in high-dimensional problems such as text classification.
It can output posterior class probabilities, although these probabilities are often not well calibrated.
The decision rule depends on a product of class-conditional terms, so in practice it is often computed on the log scale for numerical stability.

Using logarithms, the prediction rule becomes

\overset{y}{^} (x) = ar g k max [lo g P (Y = k) + j = 1 \sum p lo g f (x_{j} ∣ Y = k)] .

Intuition

The model gives each class a score made of two parts: a prior preference for the class and evidence from each feature. Each feature contributes independently to the total score under the model assumption.

Warning

Since the model multiplies many probabilities together, it can assign extreme posterior probabilities even when the conditional independence assumption is not very accurate.

Worked Example

Example

Suppose we have binary classification with classes $Y \in {0, 1}$ and two binary features $X_{1}, X_{2}$ . Assume the estimated class priors are
$P (Y = 1) = 0.4, P (Y = 0) = 0.6.$
Also suppose the estimated class-conditional probabilities are
$P (X_{1} = 1 ∣ Y = 1) = 0.8, P (X_{2} = 1 ∣ Y = 1) = 0.7,$
and
$P (X_{1} = 1 ∣ Y = 0) = 0.3, P (X_{2} = 1 ∣ Y = 0) = 0.2.$
For a new observation with $x_{1} = 1$ and $x_{2} = 1$ , naive Bayes assigns the class-1 score
$P (Y = 1) P (X_{1} = 1 ∣ Y = 1) P (X_{2} = 1 ∣ Y = 1) = 0.4 \cdot 0.8 \cdot 0.7 = 0.224$
and the class-0 score
$P (Y = 0) P (X_{1} = 1 ∣ Y = 0) P (X_{2} = 1 ∣ Y = 0) = 0.6 \cdot 0.3 \cdot 0.2 = 0.036.$
Since $0.224 > 0.036$ , the classifier predicts class 1.

This example shows the basic pattern: each class gets a prior term and then a multiplicative contribution from each feature. The predicted class is the one with the larger resulting score.

Common Variants

The main difference between common naive Bayes models is the form assumed for the class-conditional feature distributions.

Gaussian naive Bayes: for continuous features, assume

X_{j} ∣ Y = k \sim N (μ_{jk}, σ_{jk}^{2}) .

Bernoulli naive Bayes: for binary features, model whether each feature is present or absent.
Multinomial naive Bayes: often used for count data such as bag-of-words text features.

The classifier formula is the same in all cases; only the choice of the feature model changes.

Estimation Methods

The parameters are usually estimated by frequency counts or by simple maximum likelihood calculations within each class.

Class Priors

The class prior probabilities are typically estimated by

\hat{P} (Y = k) = \frac{n _{k}}{n}

where $n_{k}$ is the number of training examples in class $k$ .

Class-Conditional Feature Models

The class-conditional parameters are estimated separately for each class.

For example, in Gaussian naive Bayes,

\overset{μ}{^}_{jk} = \frac{1}{n _{k}} i : y_{i} = k \sum x_{ij}

and

\overset{σ}{^}_{jk}^{2} = \frac{1}{n _{k}} i : y_{i} = k \sum (x_{ij} - \overset{μ}{^}_{jk})^{2} .

In Bernoulli or multinomial naive Bayes, the corresponding probabilities are estimated from class-specific counts.

Smoothing

For discrete naive Bayes models, it is common to use Laplace smoothing or additive smoothing to avoid zero probabilities.

Warning

Without smoothing, if a feature value never appears in the training data for a class, the estimated class-conditional probability can be zero. Since naive Bayes multiplies probabilities across features, one zero term can force the entire posterior score for that class to zero.

Interpretation

Smoothing prevents the model from becoming overconfident just because some feature/class combination was not observed in a finite sample.

Text Classification Example

For text classification, a common choice is multinomial naive Bayes.

Suppose:

Each document has a class label, such as spam or not spam.
The features are word counts from a vocabulary of size $V$ .
$N_{jk}$ is the total number of times word $j$ appears in training documents from class $k$ .

Then the class prior is estimated by simple class frequencies:

\hat{P} (Y = k) = \frac{n _{k}}{n}

where $n_{k}$ is the number of training documents in class $k$ .

The class-conditional word probabilities are estimated by normalized counts:

\hat{P} (word j ∣ Y = k) = \frac{N _{jk}}{\sum _{m = 1}^{V} N _{mk}} .

With Laplace smoothing,

\hat{P} (word j ∣ Y = k) = \frac{N _{jk} + α}{\sum _{m = 1}^{V} N _{mk} + α V}

where often $α = 1$ .

Example

Suppose 40 out of 100 training emails are spam. Then
$\hat{P} (spam) = 0.4.$
If the word free appears 120 times in spam emails but only 10 times in non-spam emails, then the estimated probability of the word free given spam will be much larger than the estimated probability of free given non-spam. This makes free evidence in favor of the spam class.

At prediction time, the model combines the prior and the word evidence for each class. In log form, the class score is

lo g P (Y = k) + j = 1 \sum V x_{j} lo g P (word j ∣ Y = k)

where $x_{j}$ is the count of word $j$ in the new document.

Model Selection

Compared with linear or logistic regression, model selection for naive Bayes is usually less about adding or removing many coefficients and more about choosing the feature representation and the class-conditional distribution.

Common choices include:

Choosing the variant: Gaussian, Bernoulli, or multinomial naive Bayes depending on the feature type.
Cross-validation: useful for comparing different feature representations, smoothing levels, or preprocessing choices.
Log loss: useful if you care about the quality of predicted probabilities.
ROC-AUC / PR-AUC: useful for binary classification when ranking quality matters.
Accuracy / error rate: often used as a simple baseline comparison metric.

Rule of Thumb

In naive Bayes, good performance often depends more on the feature encoding and the choice of class-conditional model than on complicated optimization or inference machinery.

Naive Bayes vs Logistic Regression

Naive Bayes and logistic regression are both standard probabilistic classifiers, but they make different modeling choices.

Naive Bayes is a generative model:

It models $P (Y)$ and $f (X ∣ Y)$ .
It uses Bayes’ theorem to compute $P (Y ∣ X)$ .
It relies on the conditional independence assumption to make the feature model tractable.

Logistic regression is a discriminative model:

It models $P (Y ∣ X)$ directly.
It does not try to model the full feature distribution.
It assumes linearity on the log-odds scale rather than conditional independence of features.

Reasons to prefer naive Bayes:

You want a very fast, simple baseline.
The feature dimension is large, and you want to avoid modeling a full joint class-conditional distribution over the features.
You have relatively limited data and want a model with simple class-conditional estimates.
The chosen feature model is a good fit for the data.

Reasons to prefer logistic regression:

You care about modeling the decision boundary directly.
The predictors are strongly correlated, so the naive conditional independence assumption is not very plausible.
You want coefficients with a direct log-odds interpretation.
You want predicted probabilities that are often better calibrated.

Rule of Thumb

Naive Bayes often works well as a fast high-dimensional baseline because the conditional independence assumption avoids the much harder problem of modeling a full joint feature distribution within each class. Logistic regression is often preferred when interpretability, calibration, and fewer independence assumptions are more important.

Warning

Neither model dominates the other in every setting. Naive Bayes can outperform logistic regression when its assumptions are useful approximations and the sample size is limited, while logistic regression often improves as more labeled data become available.

Model Assumptions

The assumptions below are the main ones that come up in interviews and in practice.

Assumption	Meaning	How to check	If violated
Correct feature model	The chosen class-conditional distribution is a reasonable fit for the feature type, such as Gaussian for continuous features or multinomial for counts.	Compare the data type and empirical behavior to the model choice.	Predictions can degrade because the class-conditional probabilities are poorly estimated.
Conditional independence	Given the class, the features are independent.	Think about whether features remain strongly related even after conditioning on the label.	Posterior probabilities can be distorted, though classification can still work well.
Representative class priors	The estimated class frequencies reflect the real deployment setting, unless priors are set manually.	Compare training class balance to the target population.	The model may systematically overpredict or underpredict some classes.
Sufficient data per class	Each class has enough observations to estimate its feature distributions reliably.	Check class counts, rare categories, and sparsity.	Parameter estimates become noisy and unstable, especially for minority classes.

Some additional practical diagnostics are worth remembering:

If the predicted probabilities are important, check calibration, since naive Bayes probabilities are often too extreme.
If many features are redundant copies of one another, the model may effectively count the same evidence multiple times.
For text and count data, smoothing is often essential.

Interview Summary

If the interviewer asks for the assumptions of naive Bayes, the safest compact answer is: choose an appropriate class-conditional distribution, assume the features are conditionally independent given the class, estimate reasonable class priors, and have enough data in each class to estimate the required probabilities.

Warning

A common mistake is to say that naive Bayes assumes the features are independent in general. The actual assumption is weaker and more specific: the features are independent conditional on the class label.

Jake Tuero

Explorer

Naive Bayes Classifier

Naive Bayes Classifier

Conditional Independence

Properties

Worked Example

Common Variants

Estimation Methods

Class Priors

Class-Conditional Feature Models

Smoothing

Text Classification Example

Model Selection

Naive Bayes vs Logistic Regression

Model Assumptions

Table of Contents