In this article I show how to reproduce the Top-1 and Top-5 accuracy results reported for pre-trained models provided in the popular Keras python library. The accompanying code can be found on GitHub.

“ImageNet” validation results are frequently reported or referenced in computer vision literature but, unlike other benchmark datasets, it is not immediately obvious how to calculate such results from scratch. Validation results on ImageNet in particular are useful, even if you aren’t interested in training new models on the full training set, due to the larger image sizes and difficulty of the 1000-way classification task. For example, if you are interested in developing adversarial attacks you might be interested in using pre-trained ImageNet models, perturbing images from the validation set, and measuring the model’s performance relative to published baselines.

## Steps

“ImageNet” validation results on object classification tasks are usually calculated with the ILSVRC2012 validation set. These validation results include those reported for the pre-trained models from the Keras library. The validation dataset is 6.74GB and can be downloaded slowly from the ImageNet website or quickly from Academic Torrents. The ILSVRC2012_img_val.tar file contains 50,000 jpeg image files: “ILSVRC2012_val_00000000.JPEG”, …, “ILSVRC2012_val_00050000.JPEG”.

The ILSVRC2012 development toolkit for Tasks 1 and 2 is also necessary to reproduce validation results and can also be downloaded from the ImageNet website (the “Development kit (Task 1 & 2)”). The files we need from this include:

• readme.txt - useful information about ImageNet concepts (synsets, WordNet, etc.) and how the data is structured.
• data/ILSVRC2012_validation_ground_truth.txt - a text file with 50,000 lines, one for each validation image, where line 0 contains the ILSVRC2012_ID label for “ILSVRC2012_val_00000000.JPEG”, line 1 contains the label for “ILSVRC2012_val_00000001.JPEG”, etc. Each ILSVRC2012_ID label is an integer from 1 to 1000.
• data/meta.mat - a Matlab file that contains a mapping between the 1000 ILSVRC2012_ID labels, their synset IDs, and what the classes represent (e.g. “dog”).

### Keras/Caffe labels are different

The ordering of classes predicted by the Imagenet pre-trained models from Keras does not directly align with the ILSVRC2012_ID labeling. For example, when a Keras model predicts class “0”, it corresponds to synset “n01440764”, which is “tench, Tinca tinca”, whereas the first ILSVRC2012_ID, “1”, corresponds to synset “n02119789”, which is “kit fox, Vulpes macrotis”.

This discrepancy is due to a design choice in the Caffe library where the synsets names were sorted alphabetically and labeled from 0 to 999 instead of used as is from the ILSVRC devkit (for more information see here). This mapping, between the class numbers predicted by Keras pretrained numbers and synset/class names, can be found in the “synset_words.txt” file from the Caffe Github repository. synset_words.txt contains 1,000 lines, where line 0 contains the synset/class name corresponding to “class 0” predicted by the Keras models, line 1 contains the mapping for “class 1”, etc.

### Preprocessing validation set images

The images in the ImageNet validation set come in a wide variety of different sizes and must be resized to 224x224 in a specific way in order to reproduce the Keras benchmark results. For every image in the validation set we need to apply the following process:

1. Load the image data in a floating point format.
2. Resize the smallest side of the image to 256 pixels using bicubic interpolation over 4x4 pixel neighborhood (using OpenCVs resize method with the “INTER_CUBIC” interpolation flag). The larger side should be resized to maintain the original aspect ratio of the image.
3. Crop the central 224x224 window from the resized image.
4. Save the image in RGB format.

### Applying Keras pretrained models

The different models under keras.applications don’t all apply the same preprocessing steps, however each model has a preprocess_input method that expects floating point RGB images, like we calculated above. As an example of these differences, the keras.applications.vgg19.preprocess_input method will convert the input to BGR format and subtract 103.939 from the blue channel, 116.779 from the green channel, and 123.68 from the red channel, while the keras.applications.mobilenet.preprocess_input method will divide all channels by 127.5 then subtract 1. These differences stem from what framework each model was trained in. With the preprocessed images from the previous step we can load a Keras model, call its assosciated preprocess_input method on all the images, and finally calculate the Top-1 and Top-5 accuracy between the model’s predictions and the ground truth from data/ILSVRC2012_validation_ground_truth.txt (with the necessary label conversions).

### Code

All the steps above are shown in the notebooks from the accompanying GitHub repository: