Vision Models in Action

Draw a digit (0–9) on the canvas and hit Run to watch a neural network classify it in real time. The animation shows the actual intermediate feature maps as the image passes through each layer. Hover over any layer in the visualization to inspect individual channel activations and filter weights. Use the tabs to switch between a CNN and a Vision Transformer.

Both models are trained on MNIST. The CNN (~6.2k params) has three 3×3 conv layers, max pool, global avg pool, and a linear classifier — trained 25 epochs reaching 96.3% test accuracy. The ViT (~19.7k params) splits the image into 7×7 patches and processes them with 2 transformer layers with multi-head self-attention — trained 30 epochs reaching 97.2% test accuracy. Both were trained with the Adam optimizer on a laptop CPU. All weights are hardcoded directly into the page source and inference is pure JavaScript.

Challenge: Can you break it, and can you figure out what part of it broke?
(Hint: some of the sample images are incorrectly classified.)

Draw a Digit

Or try a sample:

Network Visualization

Hover over a layer to inspect individual channel activations and filter weights.

Prediction

Draw a digit and hit Run

Architecture

Trained on MNIST (60k handwritten digits). The network has ~6.2k parameters.