🌲🌳 Random Forest Interactive Demo 🌳🌲

This demo uses the UCI Adult Dataset (also known as the "Census Income" dataset), extracted from the 1994 U.S. Census by Ronny Kohavi and Barry Becker. It is a classic binary classification benchmark where the goal is to predict whether an individual's annual income exceeds $50,000 based on demographic attributes.

For this interactive demonstration, we use a subset of 2,000 training samples and 1,000 test samples. The histogram shows the test accuracy distribution of individual decision trees trained on different bootstrap samples (random samples drawn with replacement).

TEST ACCURACY DISTRIBUTION

Individual Tree Accuracies
Single Tree (no bootstrap)
Random Forest Ensemble

Understanding Random Forests

Random Forests are an ensemble learning method that combines multiple decision trees to produce more robust and accurate predictions. The key ideas are bagging (bootstrap aggregating) and feature randomization.

Bootstrap Sampling

Each tree in the forest is trained on a different bootstrap sample - a random sample drawn with replacement from the original training data. This means each tree sees a slightly different version of the data.

$$P(\text{sample in bootstrap}) = 1 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 63.2\%$$

On average, each bootstrap sample contains about 63.2% of the unique training examples, with some appearing multiple times.

  • Creates diversity among trees
  • Reduces overfitting to specific data points
  • Out-of-bag samples can be used for validation

Ensemble Prediction

The final prediction is made by aggregating predictions from all trees. For classification, this is typically done through majority voting.

$$\hat{y} = \text{mode}\left(\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_T\right)$$

Where \(\hat{y}_t\) is the prediction from tree \(t\) and \(T\) is the total number of trees.

Why it works: If individual trees have >50% accuracy and make independent errors, the ensemble error decreases exponentially with more trees (Condorcet's Jury Theorem).
  • Reduces variance without increasing bias
  • More stable predictions than single trees
  • Robust to outliers and noise

Why Random Forests Outperform Single Trees

The histogram above demonstrates a key insight: individual decision trees trained on bootstrap samples have varying performance, but their ensemble (the random forest) typically outperforms most individual trees. This happens because:

  • Variance Reduction: Averaging predictions from many trees reduces the variance inherent in single decision trees
  • Error Cancellation: When trees make different errors, aggregation tends to cancel out individual mistakes
  • Robustness: The ensemble is less sensitive to the specific training data used

The expected variance of the ensemble can be expressed as (Breiman, 2001):

$$\text{Var}(\bar{X}) = \frac{\sigma^2}{T} + \frac{T-1}{T}\rho\sigma^2$$

Where \(\sigma^2\) is the variance of individual trees, \(T\) is the number of trees, and \(\rho\) is the average correlation between trees. Bootstrap sampling and feature randomization help reduce \(\rho\), making the ensemble more effective.