Decision Tree Interactive Demo

This demo uses the UCI Adult Dataset (also known as the "Census Income" dataset), extracted from the 1994 U.S. Census by Ronny Kohavi and Barry Becker. It is a classic binary classification benchmark where the goal is to predict whether an individual's annual income exceeds $50,000 based on demographic attributes.

For this interactive demonstration, we use a subset of 2,000 training samples and 1,000 test samples, with a selection of features including age, education level, hours worked per week, capital gains, marital status, occupation, and sex.

TREE STRUCTURE

100%
Predicts >50K Predicts ≤50K ← Yes | No →

Feature Importance

Build a tree to see importance

BUILD YOUR TREE

--
Accuracy
--
Precision
--
Recall
--
F1

YOUR TREE STRUCTURE

100%
Predicts >50K Predicts ≤50K ← Yes | No →

Feature Distribution (select a feature in your tree)

Add a split node to see distribution

Understanding Split Criteria

Decision trees use impurity measures to determine the best feature and threshold for splitting data. The goal is to create child nodes that are as "pure" as possible (containing mostly one class).

Gini Impurity

Measures the probability of incorrectly classifying a randomly chosen element if it were randomly labeled according to the distribution of labels in the node.

$$\text{Gini}(D) = 1 - \sum_{i=1}^{C} p_i^2$$

Where \(p_i\) is the proportion of samples belonging to class \(i\) in dataset \(D\), and \(C\) is the number of classes.

Example: For a node with 70% class A and 30% class B:
$$\text{Gini} = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 0.42$$
  • Range: 0 (pure) to 0.5 (for binary classification, maximum impurity)
  • Computationally efficient (no logarithms)
  • Default in scikit-learn's DecisionTreeClassifier

Entropy (Information Gain)

Based on information theory, entropy measures the average amount of information needed to identify the class of an element. Information gain is the reduction in entropy after a split.

$$\text{Entropy}(D) = -\sum_{i=1}^{C} p_i \log_2(p_i)$$

The information gain from splitting on feature \(A\) is:

$$\text{Gain}(D, A) = \text{Entropy}(D) - \sum_{v \in \text{values}(A)} \frac{|D_v|}{|D|} \text{Entropy}(D_v)$$
Example: For a node with 70% class A and 30% class B:
$$\text{Entropy} = -(0.7 \log_2 0.7 + 0.3 \log_2 0.3) \approx 0.88 \text{ bits}$$
  • Range: 0 (pure) to \(\log_2(C)\) (for \(C\) classes with equal distribution)
  • Has roots in information theory (Shannon entropy)
  • Used in the classic ID3 and C4.5 algorithms

Choosing a Criterion

In practice, Gini and Entropy often produce similar trees. Key considerations:

  • Speed: Gini is slightly faster (no log computation)
  • Tendency: Entropy may create slightly more balanced trees
  • Multi-class: Both work well, but entropy's range scales with the number of classes

The weighted impurity decrease for a split is calculated as:

$$\Delta \text{Impurity} = \text{Impurity}(D) - \frac{n_L}{n} \text{Impurity}(D_L) - \frac{n_R}{n} \text{Impurity}(D_R)$$

Where \(D_L\) and \(D_R\) are the left and right child datasets, and \(n_L\), \(n_R\), \(n\) are their respective sample counts.