Random Forest Explorer - UCI Adult Income Dataset

Understanding Random Forests

Random Forests are an ensemble learning method that combines multiple decision trees to produce more robust and accurate predictions. The key ideas are bagging (bootstrap aggregating) and feature randomization.

Bootstrap Sampling

Each tree in the forest is trained on a different bootstrap sample - a random sample drawn with replacement from the original training data. This means each tree sees a slightly different version of the data.

$$P(\text{sample in bootstrap}) = 1 - \left(1 - \frac{1}{n}\right)^n \approx 1 - e^{-1} \approx 63.2\%$$

On average, each bootstrap sample contains about 63.2% of the unique training examples, with some appearing multiple times.

Creates diversity among trees
Reduces overfitting to specific data points
Out-of-bag samples can be used for validation

Ensemble Prediction

The final prediction is made by aggregating predictions from all trees. For classification, this is typically done through majority voting.

$$\hat{y} = \text{mode}\left(\hat{y}_1, \hat{y}_2, \ldots, \hat{y}_T\right)$$

Where $\hat{y}_t$ is the prediction from tree $t$ and $T$ is the total number of trees.

Why it works: If individual trees have >50% accuracy and make independent errors, the ensemble error decreases exponentially with more trees (Condorcet's Jury Theorem).

Reduces variance without increasing bias
More stable predictions than single trees
Robust to outliers and noise

Why Random Forests Outperform Single Trees

The histogram above demonstrates a key insight: individual decision trees trained on bootstrap samples have varying performance, but their ensemble (the random forest) typically outperforms most individual trees. This happens because:

Variance Reduction: Averaging predictions from many trees reduces the variance inherent in single decision trees
Error Cancellation: When trees make different errors, aggregation tends to cancel out individual mistakes
Robustness: The ensemble is less sensitive to the specific training data used

The expected variance of the ensemble can be expressed as (Breiman, 2001):

$$\text{Var}(\bar{X}) = \frac{\sigma^2}{T} + \frac{T-1}{T}\rho\sigma^2$$

Where $\sigma^2$ is the variance of individual trees, $T$ is the number of trees, and $\rho$ is the average correlation between trees. Bootstrap sampling and feature randomization help reduce $\rho$, making the ensemble more effective.

🌲🌳 Random Forest Interactive Demo 🌳🌲

Tree Parameters

Forest Parameters

Features

TEST ACCURACY DISTRIBUTION

Understanding Random Forests

Bootstrap Sampling

Ensemble Prediction

Why Random Forests Outperform Single Trees