Multi-Head Attention Walkthrough

Step through multi-head self-attention on image patches to see exactly how queries, keys, and values are computed, how attention scores become weights, and how multiple heads combine to produce the output. Uses tiny random weights running entirely in the browser.

See also: Vision Models in Action for a full ViT forward pass on MNIST.

Input

Or try a sample:
Drawing is downsampled to 12×12, then split into a 3×3 grid of 4×4 patches (9 patches total).

Visualization

Draw or pick a sample, then click Run
Draw or pick a sample, then click Run

Welcome

Draw something on the canvas (or pick a sample pattern), then click Run Attention to step through multi-head self-attention.

We use a tiny setup with a fixed 12×12 input image, configurable patch size, embedding dimension, and number of heads. Adjust these in the Architecture panel on the right. All matrices are small enough to inspect cell by cell.

Architecture

Image: 12×12 — 3×3 grid of 4×4 patches (9 patches) — D=8, 2 heads (head_dim=4) — 10 tokens

Display

Focus token (for attention steps):
Head visibility:
Re-generates random projection weights and reruns from step 0.

Color Legend

CLS token
Query (Q)
Key (K)
Value (V)
Head highlight
Heatmaps use the Viridis colormap (dark purple → yellow).