One of the most repeated pieces of retail folklore is the claim that a grocery chain discovered, via data mining, that customers who bought diapers on Thursday evenings were also likely to buy beer. The widely told version is that young fathers, sent out for diapers after work, would grab beer for themselves while they were at it — leading the store to place the two products near each other.
Whether or not the anecdote is literally true (it is probably embellished), it's a tidy illustration of the problem: given a log of transactions, which items tend to co-occur, and can we turn those co-occurrences into useful if-then rules?
Association rule learning tackles exactly that problem. It is an unsupervised method — there are no labels — and the output is a set of rules of the form:
The key questions are (a) which rules show up often enough to matter, and (b) which of them reflect a real association rather than the fact that both items are simply popular.
Before running any algorithm, here is the vocabulary we will use. Everything is defined over a collection $\mathcal{D}$ of transactions, where each transaction $T \subseteq \mathcal{I}$ is a set of items drawn from a universe $\mathcal{I}$.
beer. The full universe of items is $\mathcal{I}$.Below are 20 transactions from a toy grocery store. We have rigged the data so that diapers and beer co-occur often enough to tell the classic story — but the dataset also contains a mix of unrelated items (bread, eggs, milk, cheese, wine, chips, baby_food, formula) so we can generate many competing rules.
Try this: click items in the palette below to build an itemset. The table highlights baskets that contain all selected items, and the KPI cards live-update support, confidence, and lift.
Shift and click to mark an item as the consequent (the right-hand side of the rule X ⇒ Y). All other selected items form the antecedent.| TID | Items |
|---|
Support
Confidence
Lift
Suppose 95% of baskets contain milk. Then almost any rule like {X} ⇒ {milk} will have high confidence — just because milk is ubiquitous, not because of any real association with $X$.
Lift normalizes by the baseline popularity of $Y$. If the lift of {X} ⇒ {milk} is close to 1, we learn that $X$ actually tells us nothing beyond the marginal rate of milk. For the beer-and-diapers story we want both high confidence and lift > 1.
Brute force would evaluate every subset of items — $2^{|\mathcal{I}|}$ itemsets — hopeless for any real dataset. Apriori (Agrawal & Srikant, 1994) prunes this space using one beautiful observation:
That means: to find frequent $k$-itemsets, we only need to consider candidates whose $(k{-}1)$-subsets are already known to be frequent. We build up from size 1 upward, pruning aggressively at each level.
Below you can run it on our 20-basket dataset. Adjust the thresholds, then step through each phase. Pruned candidates are highlighted so you can see the Apriori property in action.
| Rule (antecedent ⇒ consequent) | support | confidence | lift |
|---|
Rare items are invisible
A hard support threshold hides any rule involving an uncommon item, no matter how strong the association. Domain-specific fixes include multi-level support, targeted item selection, or alternative metrics (conviction, leverage).
Many rules aren't causal
High lift means co-occurrence beyond chance — not that buying $X$ causes buying $Y$. The famous ice-cream / drowning correlation has high lift but no causal link; the hidden cause is summer.
Multiple testing
Apriori can emit hundreds of thousands of rules on a real dataset. Many will look impressive by chance alone. In practice you filter by lift, apply minimum-length constraints, drop rules whose consequent is too common, and review them with a domain expert.
Scaling
Apriori makes one full data pass per level of $k$. For dense datasets or long baskets, more modern algorithms like FP-Growth (frequent-pattern tree) and ECLAT (vertical TID-list intersections) can be much faster. The metrics and rule-generation logic are identical.