Decision Trees

Decision trees are supervised learning methods for both regression and classification. They work by recursively splitting the predictor space into simple regions, then making one prediction per region.

Their main appeal is interpretability: the model can be read as a sequence of simple if-then rules. Their main weakness is instability: small changes in the data can lead to a very different tree.

Prediction

The idea is simple:

Partition the predictor space into $J$ disjoint regions $R_{1}, \dots, R_{J}$ .
For a new observation, find which region it falls into.
Use the same prediction for every point in that region.

For a regression tree, that prediction is usually the mean response in the region. For a classification tree, it is usually the most common class in the region.

Intuition

A decision tree turns a complicated prediction problem into a sequence of simple questions such as “is $X_{3} < 5$ ?“. Each split narrows down which part of the feature space the observation belongs to.

Building a Tree

Trees are usually built by recursive binary splitting. Starting with all observations in one region, we repeatedly choose:

A predictor $X_{j}$
A cutpoint $s$

This creates two child regions:

R_{1} (j, s) = {X ∣ X_{j} < s}, R_{2} (j, s) = {X ∣ X_{j} \geq s} .

For regression, we choose the split that gives the largest reduction in RSS:

i : x_{i} \in R_{1} (j, s) \sum (y_{i} - \overset{y}{^}_{R_{1}})^{2} + i : x_{i} \in R_{2} (j, s) \sum (y_{i} - \overset{y}{^}_{R_{2}})^{2} .

Here $\overset{y}{^}_{R_{1}}$ and $\overset{y}{^}_{R_{2}}$ are the mean responses in the two child regions.

In Practice

In practice, the algorithm tries many candidate splits and picks the best one greedily. To find $j$ and $s$ , we loop over predictors and test candidate thresholds, usually between consecutive sorted observed feature values. For each candidate split, we compute the impurity decrease and keep the best one.

For a continuous predictor, there is usually no closed-form best split. Instead, after sorting the observed values of $X_{j}$ , we only need to test thresholds between consecutive distinct values, often using midpoints such as
$s = \frac{x _{(k)} + x _{(k + 1)}}{2} .$
This is enough because moving $s$ within an interval that contains no data points does not change which observations go left or right.

Warning

Tree-growing is greedy: it chooses the best split right now, not the globally best tree. This is one reason trees can be suboptimal even when they are easy to interpret.

Implementation

The code below implements a very small regression tree for continuous features. It is not optimized, but it shows the core mechanics:

Search over all features and candidate thresholds
Pick the split with the lowest RSS
Recurse on the left and right children
Predict with the mean response at each leaf

from dataclasses import dataclass
import numpy as np
 
 
@dataclass
class Node:
    feature: int = None
    threshold: float = None
    left: "Node" = None
    right: "Node" = None
    value: float = None
 
 
class DecisionTreeRegressor:
    def __init__(self, max_depth=3, min_samples_split=2):
        self.max_depth = max_depth
        self.min_samples_split = min_samples_split
        self.root = None
 
    def fit(self, X, y):
        X = np.asarray(X, dtype=float)
        y = np.asarray(y, dtype=float)
        self.root = self._build_tree(X, y, depth=0)
 
    def predict(self, X):
        X = np.asarray(X, dtype=float)
        return np.array([self._predict_one(x, self.root) for x in X])
 
    def _build_tree(self, X, y, depth):
        # Stop splitting when the tree is deep enough or the node is too small
        if depth >= self.max_depth or len(y) < self.min_samples_split:
            return Node(value=float(np.mean(y)))
 
        split = self._best_split(X, y)
        # If no valid split exists, make this node a leaf
        if split is None:
            return Node(value=float(np.mean(y)))
 
        feature, threshold = split
        # Send each example to the left or right child according to the split rule
        left_mask = X[:, feature] < threshold
        right_mask = ~left_mask
 
        return Node(
            feature=feature,
            threshold=threshold,
            left=self._build_tree(X[left_mask], y[left_mask], depth + 1),
            right=self._build_tree(X[right_mask], y[right_mask], depth + 1),
        )
 
    def _best_split(self, X, y):
        n, p = X.shape
        best_feature = None
        best_threshold = None
        best_rss = np.inf
 
        for feature in range(p):
            values = np.sort(np.unique(X[:, feature]))
            if len(values) < 2:
                continue
 
            # Candidate thresholds are midpoints between consecutive values
            thresholds = (values[:-1] + values[1:]) / 2.0
            for threshold in thresholds:
                left_mask = X[:, feature] < threshold
                right_mask = ~left_mask
 
                if left_mask.sum() == 0 or right_mask.sum() == 0:
                    continue
 
                left_y = y[left_mask]
                right_y = y[right_mask]
                # Score this split by the total within-node RSS after splitting
                rss = np.sum((left_y - left_y.mean()) ** 2) 
                    + np.sum((right_y - right_y.mean()) ** 2)
 
                if rss < best_rss:
                    best_rss = rss
                    best_feature = feature
                    best_threshold = threshold
 
        if best_feature is None:
            return None
 
        return best_feature, best_threshold
 
    def _predict_one(self, x, node):
        # Recursively split left/right until terminal node reached
        if node.value is not None:
            return node.value
        # Follow the same split rules used during training until we reach a leaf
        if x[node.feature] < node.threshold:
            return self._predict_one(x, node.left)
        return self._predict_one(x, node.right)

Regression Trees

A regression tree predicts a numeric response.

Definition

For regression trees, the usual split criterion is the residual sum of squares:
$RSS = j = 1 \sum J i \in R_{j} \sum (y_{i} - \overset{y}{^}_{R_{j}})^{2},$
where $\overset{y}{^}_{R_{j}}$ is the mean response of the training observations in region $R_{j}$ .

The tree keeps splitting regions so that the responses inside each terminal node become as similar as possible.

Classification Trees

A classification tree predicts a qualitative response. In each terminal node, the prediction is the majority class, though the estimated class proportions are often useful too.

One possible impurity measure is the classification error:

E = 1 - k max \overset{p}{^}_{mk},

where $\overset{p}{^}_{mk}$ is the proportion of class $k$ in region $m$ .

However, classification error is usually not sensitive enough for growing the tree. In practice, the two most common impurity measures are the Gini index and entropy.

Definition (Gini Index)

The Gini index is
$G = k = 1 \sum K \overset{p}{^}_{mk} (1 - \overset{p}{^}_{mk}) .$
It is small when one class dominates the node.

Definition (Entropy)

Entropy is
$D = - k = 1 \sum K \overset{p}{^}_{mk} lo g \overset{p}{^}_{mk} .$
Like the Gini index, it is small when the node is pure and larger when the class proportions are mixed.

Intuition

Gini and entropy both measure how mixed a node is. A pure node has mostly one class and low impurity. A mixed node has several classes present and high impurity.

In practice, Gini and entropy often lead to very similar trees.

Tree Pruning

If we keep splitting until the tree fits the training data very closely, the tree will usually overfit. A deep tree has low bias but high variance.

The standard fix is:

Grow a large tree $T_{0}$ .
Prune it back to a smaller subtree.
Use validation or cross-validation to choose the subtree size.

Definition

Cost complexity pruning chooses a subtree $T \subset T_{0}$ to minimize
$m = 1 \sum ∣ T ∣ x_{i} \in R_{m} \sum (y_{i} - \overset{y}{^}_{R_{m}})^{2} + α ∣ T ∣,$
where $∣ T ∣$ is the number of terminal nodes and $α \geq 0$ penalizes tree size.

When $α = 0$ , there is no penalty and we keep the large tree. As $α$ increases, smaller trees are preferred.

Intuition

Pruning says: do not keep a split unless it improves fit enough to justify the extra complexity.

Strengths and Weaknesses

Strengths:

Very easy to explain and visualize
Naturally capture nonlinear effects and interactions
Can handle qualitative predictors without manually creating dummy variables

Weaknesses:

High variance: small data changes can produce a very different tree
Often worse predictive accuracy than stronger supervised learning methods
Greedy splitting does not guarantee the globally best tree

Bagging

Trees are high-variance models, so averaging many trees can help a lot. This is the idea behind bagging.

Using bootstrap samples, we fit trees

\hat{f}^{1} (x), \dots, \hat{f}^{B} (x)

and average them:

\hat{f}_{avg} (x) = \frac{1}{B} b = 1 \sum B \hat{f}^{b} (x) .

For classification, we usually use majority vote instead of averaging numeric predictions.

Intuition

A single tree is unstable, but the average of many unstable trees can be much more stable.

Random Forests

Random forests improve bagging by making the trees less correlated.

As in bagging, each tree is fit on a bootstrap sample. The extra idea is that at each split, the algorithm only considers a random subset of the predictors rather than all $p$ predictors.

If there is one very strong predictor, ordinary bagging may use it in almost every tree, making the trees similar. Randomly restricting the candidate predictors forces different trees to explore different splits, which reduces correlation and usually improves the ensemble.

If we choose all predictors at every split, that is just bagging.

Rule of Thumb

Bagging reduces variance by averaging many trees. Random forests usually do better because they reduce both variance and tree-to-tree correlation.

Boosting

Boosting also builds many trees, but in a very different way. Instead of fitting many large independent trees, it fits trees sequentially, where each new tree tries to improve on the mistakes made so far.

Algorithm (Boosting for Regression Trees)

Set the current model to $\hat{f} (x) = 0$ and initialize residuals as $r_{i} = y_{i}$ .

For $b = 1, \dots, B$ :

Fit a small tree $\hat{f}^{b}$ to the current residuals.

Update the model:

$\hat{f} (x) \leftarrow \hat{f} (x) + λ \hat{f}^{b} (x) .$

Update the residuals:

$r_{i} \leftarrow r_{i} - λ \hat{f}^{b} (x_{i}) .$

Output

$\hat{f} (x) = b = 1 \sum B λ \hat{f}^{b} (x) .$

The main tuning parameters are:

The number of trees $B$
The learning rate or shrinkage parameter $λ$
The depth of each tree

Often the individual trees are small, sometimes even stumps with just one split.

Intuition

Boosting builds a strong model by repeatedly adding weak trees that correct what the current model still gets wrong.

Warning

Unlike bagging and random forests, boosting can overfit if pushed too far, though in practice it is often fairly resistant when the learning rate is small.

Jake Tuero

Explorer

Decision Trees

Decision Trees

Prediction

Building a Tree

Implementation

Regression Trees

Classification Trees

Tree Pruning

Strengths and Weaknesses

Bagging

Random Forests

Boosting

Table of Contents