This demo uses the UCI Adult Dataset (also known as the "Census Income" dataset), extracted from the 1994 U.S. Census by Ronny Kohavi and Barry Becker. It is a classic binary classification benchmark where the goal is to predict whether an individual's annual income exceeds $50,000 based on demographic attributes.
For this interactive demonstration, we use a subset of 2,000 training samples and 1,000 test samples. The histogram shows the test accuracy distribution of individual decision trees trained on different bootstrap samples (random samples drawn with replacement).
Random Forests are an ensemble learning method that combines multiple decision trees to produce more robust and accurate predictions. The key ideas are bagging (bootstrap aggregating) and feature randomization.
Each tree in the forest is trained on a different bootstrap sample - a random sample drawn with replacement from the original training data. This means each tree sees a slightly different version of the data.
On average, each bootstrap sample contains about 63.2% of the unique training examples, with some appearing multiple times.
The final prediction is made by aggregating predictions from all trees. For classification, this is typically done through majority voting.
Where \(\hat{y}_t\) is the prediction from tree \(t\) and \(T\) is the total number of trees.
The histogram above demonstrates a key insight: individual decision trees trained on bootstrap samples have varying performance, but their ensemble (the random forest) typically outperforms most individual trees. This happens because:
The expected variance of the ensemble can be expressed as (Breiman, 2001):
Where \(\sigma^2\) is the variance of individual trees, \(T\) is the number of trees, and \(\rho\) is the average correlation between trees. Bootstrap sampling and feature randomization help reduce \(\rho\), making the ensemble more effective.