This demo uses the UCI Adult Dataset (also known as the "Census Income" dataset), extracted from the 1994 U.S. Census by Ronny Kohavi and Barry Becker. It is a classic binary classification benchmark where the goal is to predict whether an individual's annual income exceeds $50,000 based on demographic attributes.
For this interactive demonstration, we use a subset of 2,000 training samples and 1,000 test samples, with a selection of features including age, education level, hours worked per week, capital gains, marital status, occupation, and sex.
Add a split node to see distribution
Decision trees use impurity measures to determine the best feature and threshold for splitting data. The goal is to create child nodes that are as "pure" as possible (containing mostly one class).
Measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the node.
Where \(p_i\) is the proportion of samples belonging to class \(i\) in dataset \(D\), and \(C\) is the number of classes.
Based on information theory, entropy measures the average amount of information needed to identify the class of an element. Information gain is the reduction in entropy after a split.
The information gain from splitting on feature \(A\) is:
In practice, Gini and Entropy often produce similar trees. Key considerations:
The weighted impurity decrease for a split is calculated as:
Where \(D_L\) and \(D_R\) are the left and right child datasets, and \(n_L\), \(n_R\), \(n\) are their respective sample counts.