Multi-Head Attention Walkthrough

Step through multi-head self-attention on image patches to see exactly how queries, keys, and values are computed, how attention scores become weights, and how multiple heads combine to produce the output. Uses tiny random weights running entirely in the browser.

See also: Vision Models in Action for a full ViT forward pass on MNIST.

Input

Or try a sample:
Drawing is downsampled to 32×32, then split into a 4×4 grid of 8×8 patches (16 patches total).

Visualization

Draw or pick a sample, then click Run
Draw or pick a sample, then click Run

Welcome

Draw something on the canvas (or pick a sample pattern), then click Run Attention to step through multi-head self-attention.

Choose between Random Weights (explore the mechanics) and Pretrained MNIST (see real learned attention patterns on handwritten digits). Adjust patch size, embedding dimension, and number of heads in the Architecture panel.

Architecture

Display

Focus token (for attention steps):
Head visibility:
Re-generates random projection weights and reruns from step 0.

Color Legend

CLS token
Query (Q)
Key (K)
Value (V)
Head highlight
Heatmaps use the Viridis colormap (dark purple → yellow).