Multi-Head Attention Walkthrough

Step through multi-head self-attention on image patches to see exactly how queries, keys, and values are computed, how attention scores become weights, and how multiple heads combine to produce the output. Uses tiny random weights running entirely in the browser.

See also: Vision Models in Action for a full ViT forward pass on MNIST.

Input

Or try a sample:

Drawing is downsampled to 32×32, then split into a 4×4 grid of 8×8 patches (16 patches total).

Visualization

Draw or pick a sample, then click Run

Welcome

Draw something on the canvas (or pick a sample pattern), then click Run Attention to step through multi-head self-attention.

Choose between Random Weights (explore the mechanics) and Pretrained MNIST (see real learned attention patterns on handwritten digits). Adjust patch size, embedding dimension, and number of heads in the Architecture panel.

Architecture

Patch size

Embed dim

Heads

Display

Show numbers Show all heads

Focus token (for attention steps):

Head visibility:

Re-generates random projection weights and reruns from step 0.

Color Legend

CLS token

Query (Q)

Key (K)

Value (V)

Head highlight

Heatmaps use the Viridis colormap (dark purple → yellow).

Input

Visualization

Welcome

Architecture

Prediction

Display

Color Legend