Step through multi-head self-attention on image patches to see exactly how queries, keys, and values are computed, how attention scores become weights, and how multiple heads combine to produce the output. Uses tiny random weights running entirely in the browser.
See also: Vision Models in Action for a full ViT forward pass on MNIST.
Draw something on the canvas (or pick a sample pattern), then click Run Attention to step through multi-head self-attention.
We use a tiny setup with a fixed 12×12 input image, configurable patch size, embedding dimension, and number of heads. Adjust these in the Architecture panel on the right. All matrices are small enough to inspect cell by cell.