Decision Trees
Decision trees are supervised learning methods for both regression and classification. They work by recursively splitting the predictor space into simple regions, then making one prediction per region.
Their main appeal is interpretability: the model can be read as a sequence of simple if-then rules. Their main weakness is instability: small changes in the data can lead to a very different tree.
Prediction
The idea is simple:
- Partition the predictor space into disjoint regions .
- For a new observation, find which region it falls into.
- Use the same prediction for every point in that region.
For a regression tree, that prediction is usually the mean response in the region. For a classification tree, it is usually the most common class in the region.
Intuition
A decision tree turns a complicated prediction problem into a sequence of simple questions such as “is ?“. Each split narrows down which part of the feature space the observation belongs to.
Building a Tree
Trees are usually built by recursive binary splitting. Starting with all observations in one region, we repeatedly choose:
- A predictor
- A cutpoint
This creates two child regions:
For regression, we choose the split that gives the largest reduction in RSS:
Here and are the mean responses in the two child regions.
In Practice
In practice, the algorithm tries many candidate splits and picks the best one greedily. To find and , we loop over predictors and test candidate thresholds, usually between consecutive sorted observed feature values. For each candidate split, we compute the impurity decrease and keep the best one.
For a continuous predictor, there is usually no closed-form best split. Instead, after sorting the observed values of , we only need to test thresholds between consecutive distinct values, often using midpoints such as
This is enough because moving within an interval that contains no data points does not change which observations go left or right.
Warning
Tree-growing is greedy: it chooses the best split right now, not the globally best tree. This is one reason trees can be suboptimal even when they are easy to interpret.
Implementation
The code below implements a very small regression tree for continuous features. It is not optimized, but it shows the core mechanics:
- Search over all features and candidate thresholds
- Pick the split with the lowest RSS
- Recurse on the left and right children
- Predict with the mean response at each leaf
from dataclasses import dataclass
import numpy as np
@dataclass
class Node:
feature: int = None
threshold: float = None
left: "Node" = None
right: "Node" = None
value: float = None
class DecisionTreeRegressor:
def __init__(self, max_depth=3, min_samples_split=2):
self.max_depth = max_depth
self.min_samples_split = min_samples_split
self.root = None
def fit(self, X, y):
X = np.asarray(X, dtype=float)
y = np.asarray(y, dtype=float)
self.root = self._build_tree(X, y, depth=0)
def predict(self, X):
X = np.asarray(X, dtype=float)
return np.array([self._predict_one(x, self.root) for x in X])
def _build_tree(self, X, y, depth):
# Stop splitting when the tree is deep enough or the node is too small
if depth >= self.max_depth or len(y) < self.min_samples_split:
return Node(value=float(np.mean(y)))
split = self._best_split(X, y)
# If no valid split exists, make this node a leaf
if split is None:
return Node(value=float(np.mean(y)))
feature, threshold = split
# Send each example to the left or right child according to the split rule
left_mask = X[:, feature] < threshold
right_mask = ~left_mask
return Node(
feature=feature,
threshold=threshold,
left=self._build_tree(X[left_mask], y[left_mask], depth + 1),
right=self._build_tree(X[right_mask], y[right_mask], depth + 1),
)
def _best_split(self, X, y):
n, p = X.shape
best_feature = None
best_threshold = None
best_rss = np.inf
for feature in range(p):
values = np.sort(np.unique(X[:, feature]))
if len(values) < 2:
continue
# Candidate thresholds are midpoints between consecutive values
thresholds = (values[:-1] + values[1:]) / 2.0
for threshold in thresholds:
left_mask = X[:, feature] < threshold
right_mask = ~left_mask
if left_mask.sum() == 0 or right_mask.sum() == 0:
continue
left_y = y[left_mask]
right_y = y[right_mask]
# Score this split by the total within-node RSS after splitting
rss = np.sum((left_y - left_y.mean()) ** 2)
+ np.sum((right_y - right_y.mean()) ** 2)
if rss < best_rss:
best_rss = rss
best_feature = feature
best_threshold = threshold
if best_feature is None:
return None
return best_feature, best_threshold
def _predict_one(self, x, node):
# Recursively split left/right until terminal node reached
if node.value is not None:
return node.value
# Follow the same split rules used during training until we reach a leaf
if x[node.feature] < node.threshold:
return self._predict_one(x, node.left)
return self._predict_one(x, node.right)Regression Trees
A regression tree predicts a numeric response.
Definition
For regression trees, the usual split criterion is the residual sum of squares:
where is the mean response of the training observations in region .
The tree keeps splitting regions so that the responses inside each terminal node become as similar as possible.
Classification Trees
A classification tree predicts a qualitative response. In each terminal node, the prediction is the majority class, though the estimated class proportions are often useful too.
One possible impurity measure is the classification error:
where is the proportion of class in region .
However, classification error is usually not sensitive enough for growing the tree. In practice, the two most common impurity measures are the Gini index and entropy.
Definition (Gini Index)
The Gini index is
It is small when one class dominates the node.
Definition (Entropy)
Entropy is
Like the Gini index, it is small when the node is pure and larger when the class proportions are mixed.
Intuition
Gini and entropy both measure how mixed a node is. A pure node has mostly one class and low impurity. A mixed node has several classes present and high impurity.
In practice, Gini and entropy often lead to very similar trees.
Tree Pruning
If we keep splitting until the tree fits the training data very closely, the tree will usually overfit. A deep tree has low bias but high variance.
The standard fix is:
- Grow a large tree .
- Prune it back to a smaller subtree.
- Use validation or cross-validation to choose the subtree size.
Definition
Cost complexity pruning chooses a subtree to minimize
where is the number of terminal nodes and penalizes tree size.
When , there is no penalty and we keep the large tree. As increases, smaller trees are preferred.
Intuition
Pruning says: do not keep a split unless it improves fit enough to justify the extra complexity.
Strengths and Weaknesses
Strengths:
- Very easy to explain and visualize
- Naturally capture nonlinear effects and interactions
- Can handle qualitative predictors without manually creating dummy variables
Weaknesses:
- High variance: small data changes can produce a very different tree
- Often worse predictive accuracy than stronger supervised learning methods
- Greedy splitting does not guarantee the globally best tree
Bagging
Trees are high-variance models, so averaging many trees can help a lot. This is the idea behind bagging.
Using bootstrap samples, we fit trees
and average them:
For classification, we usually use majority vote instead of averaging numeric predictions.
Intuition
A single tree is unstable, but the average of many unstable trees can be much more stable.
Random Forests
Random forests improve bagging by making the trees less correlated.
As in bagging, each tree is fit on a bootstrap sample. The extra idea is that at each split, the algorithm only considers a random subset of the predictors rather than all predictors.
If there is one very strong predictor, ordinary bagging may use it in almost every tree, making the trees similar. Randomly restricting the candidate predictors forces different trees to explore different splits, which reduces correlation and usually improves the ensemble.
If we choose all predictors at every split, that is just bagging.
Rule of Thumb
Bagging reduces variance by averaging many trees. Random forests usually do better because they reduce both variance and tree-to-tree correlation.
Boosting
Boosting also builds many trees, but in a very different way. Instead of fitting many large independent trees, it fits trees sequentially, where each new tree tries to improve on the mistakes made so far.
Algorithm (Boosting for Regression Trees)
- Set the current model to and initialize residuals as .
- For :
- Fit a small tree to the current residuals.
- Update the model:
- Update the residuals:
- Output
The main tuning parameters are:
- The number of trees
- The learning rate or shrinkage parameter
- The depth of each tree
Often the individual trees are small, sometimes even stumps with just one split.
Intuition
Boosting builds a strong model by repeatedly adding weak trees that correct what the current model still gets wrong.
Warning
Unlike bagging and random forests, boosting can overfit if pushed too far, though in practice it is often fairly resistant when the learning rate is small.