Association Rule Learning — Interactive Explainer

1. The beer-and-diapers story

motivation

One of the most repeated pieces of retail folklore is the claim that a grocery chain discovered, via data mining, that customers who bought diapers on Thursday evenings were also likely to buy beer. The widely told version is that young fathers, sent out for diapers after work, would grab beer for themselves while they were at it — leading the store to place the two products near each other.

"The infamous beer-and-diapers story appears in slightly different forms. [...] The most-quoted story is that a grocery chain found young fathers were buying beer when they bought diapers in the evening, and the chain exploited this insight."

— paraphrased from the Wikipedia article on association rule learning

Whether or not the anecdote is literally true (it is probably embellished), it's a tidy illustration of the problem: given a log of transactions, which items tend to co-occur, and can we turn those co-occurrences into useful if-then rules?

Association rule learning tackles exactly that problem. It is an unsupervised method — there are no labels — and the output is a set of rules of the form:

{diapers} ⇒ {beer}

read as: "if a basket contains diapers, it often also contains beer"

The key questions are (a) which rules show up often enough to matter, and (b) which of them reflect a real association rather than the fact that both items are simply popular.

2. Key terms

definitions

Before running any algorithm, here is the vocabulary we will use. Everything is defined over a collection $\mathcal{D}$ of transactions, where each transaction $T \subseteq \mathcal{I}$ is a set of items drawn from a universe $\mathcal{I}$.

Item

An individual product, e.g. beer. The full universe of items is $\mathcal{I}$.

Itemset

A set of items, e.g. $\{\text{diapers}, \text{beer}\}$. A $k$-itemset has exactly $k$ items.

Transaction

A single shopping basket — a subset of $\mathcal{I}$ with a unique transaction id (TID).

Support

Fraction of transactions that contain the itemset. Measures how common it is.

Confidence

For rule $X \Rightarrow Y$: $P(Y \mid X)$. How reliable the rule is when $X$ is present.

Lift

Ratio of observed co-occurrence to what independence would predict. Lift > 1 means positive association.

Frequent itemset

An itemset whose support is at least a user-chosen threshold $\text{minsup}$.

Association rule

An implication $X \Rightarrow Y$ with $X \cap Y = \emptyset$, filtered by minimum support and confidence.

3. The dataset — 20 synthetic baskets

click items / rows to explore

Below are 20 transactions from a toy grocery store. We have rigged the data so that diapers and beer co-occur often enough to tell the classic story — but the dataset also contains a mix of unrelated items (bread, eggs, milk, cheese, wine, chips, baby_food, formula) so we can generate many competing rules.

Try this: click items in the palette below to build an itemset. The table highlights baskets that contain all selected items, and the KPI cards live-update support, confidence, and lift.

Items — click to add to itemset

Rule direction (optional)

Hold Shift and click to mark an item as the consequent (the right-hand side of the rule X ⇒ Y). All other selected items form the antecedent.

Itemset size

0

empty

Support

—

count / total

Confidence

—

pick a consequent

Lift

—

> 1 ⇒ positive

Transactions

TID	Items

Contains itemset Does not contain

4. Support, confidence, lift — and why all three matter

metrics

Support

$\mathrm{support}(X) = \dfrac{|\{ T \in \mathcal{D} : X \subseteq T \}|}{|\mathcal{D}|}$

Fraction of baskets containing every item in $X$. High support ⇒ the pattern is common enough to act on.

Confidence

$\mathrm{confidence}(X \Rightarrow Y) = \dfrac{\mathrm{support}(X \cup Y)}{\mathrm{support}(X)} = P(Y \mid X)$

Of all the baskets with $X$, what fraction also contain $Y$? High confidence ⇒ the rule is reliable when triggered.

Lift

$\mathrm{lift}(X \Rightarrow Y) = \dfrac{\mathrm{support}(X \cup Y)}{\mathrm{support}(X)\,\mathrm{support}(Y)} = \dfrac{P(X,Y)}{P(X)P(Y)}$

Compares observed co-occurrence to the independent case. lift = 1: independent · lift > 1: positively associated · lift < 1: negatively associated.

Why isn't confidence enough?

Suppose 95% of baskets contain milk. Then almost any rule like {X} ⇒ {milk} will have high confidence — just because milk is ubiquitous, not because of any real association with $X$.

Lift normalizes by the baseline popularity of $Y$. If the lift of {X} ⇒ {milk} is close to 1, we learn that $X$ actually tells us nothing beyond the marginal rate of milk. For the beer-and-diapers story we want both high confidence and lift > 1.

5. The Apriori algorithm — step by step

algorithm

Brute force would evaluate every subset of items — $2^{|\mathcal{I}|}$ itemsets — hopeless for any real dataset. Apriori (Agrawal & Srikant, 1994) prunes this space using one beautiful observation:

The Apriori property (downward closure): if an itemset is frequent, then all of its subsets are frequent. Equivalently, if any subset is infrequent, then every superset is also infrequent.

That means: to find frequent $k$-itemsets, we only need to consider candidates whose $(k{-}1)$-subsets are already known to be frequent. We build up from size 1 upward, pruning aggressively at each level.

Pseudocode

$L_1 \leftarrow$ frequent 1-itemsets (all items with support $\geq \text{minsup}$)
for $k = 2, 3, \ldots$ while $L_{k-1}$ non-empty:
  $C_k \leftarrow$ candidates generated by joining pairs in $L_{k-1}$
  $C_k \leftarrow$ prune any candidate with an infrequent $(k{-}1)$-subset
  $L_k \leftarrow \{\,c \in C_k : \text{support}(c) \geq \text{minsup}\,\}$
return $\bigcup_k L_k$, then generate rules with $\text{confidence} \geq \text{minconf}$

Below you can run it on our 20-basket dataset. Adjust the thresholds, then step through each phase. Pruned candidates are highlighted so you can see the Apriori property in action.

6. Run Apriori — interactive walkthrough

click a step

minsup 0.20 minconf 0.60

minsup is expressed as a fraction of all transactions ($|\mathcal{D}| = 20$). At minsup = 0.20, an itemset must appear in at least $\lceil 0.20 \times 20 \rceil = 4$ baskets to survive.

7. Discovered rules

output

sort by

min lift

—

Rule (antecedent ⇒ consequent)	support	confidence	lift

8. Practical caveats

gotchas

Rare items are invisible

A hard support threshold hides any rule involving an uncommon item, no matter how strong the association. Domain-specific fixes include multi-level support, targeted item selection, or alternative metrics (conviction, leverage).

Many rules aren't causal

High lift means co-occurrence beyond chance — not that buying $X$ causes buying $Y$. The famous ice-cream / drowning correlation has high lift but no causal link; the hidden cause is summer.

Multiple testing

Apriori can emit hundreds of thousands of rules on a real dataset. Many will look impressive by chance alone. In practice you filter by lift, apply minimum-length constraints, drop rules whose consequent is too common, and review them with a domain expert.

Scaling

Apriori makes one full data pass per level of $k$. For dense datasets or long baskets, more modern algorithms like FP-Growth (frequent-pattern tree) and ECLAT (vertical TID-list intersections) can be much faster. The metrics and rule-generation logic are identical.