It’s commonly assumed that models pretrained to achieve high performance on ImageNet will perform better on other visual tasks after fine-tuning. But is it always true? A new study reached surprising conclusions.

What’s new: Alexander Ke, William Ellsworth, Oishi Banerjee, and colleagues at Stanford systematically tested various models that were pretrained on ImageNet and fine-tuned to read X-rays. They found that accuracy on ImageNet did not correlate with performance on the fine-tuned tasks. The team also included Andrew Ng and Pranav Rajpurkar, instructor of DeepLearning.AI’s AI for Medicine Specialization.

Key insight: Previous work found that accuracy on ImageNet prior to fine-tuning correlated strongly with accuracy on some vision tasks afterward. But ImageNet images differ from X-rays, and model architecture also influences results — so knowledge gained from ImageNet may not transfer to medical images.

How it works: The authors evaluated the impact of published ImageNet performance, ImageNet training, and parameter count on the fine-tuned performance of six convolutional neural net architectures (including older ones such as ResNet and newer ones such as EfficientNet) in a variety of sizes. They fine-tuned the models to identify six medical conditions using the CheXpert dataset of X-ray images. To compensate for potential variations in implementation, they tested each model’s performance periodically during training, saved copies, and evaluated an ensemble of the 10 best performers. They gauged performance via the area under the curve (AUC), a measure of true versus false positives where 1 is a perfect score.

  • To learn whether ImageNet performance correlated with performance on CheXpert, they compared each fine-tuned model’s CheXpert AUC with the pretrained version’s published ImageNet accuracy.
  • To find the impact of ImageNet pretraining, they compared models pretrained on ImageNet with randomly initialized versions.
  • To learn whether a model’s size correlated with its performance after pretraining and fine-tuning, they compared its parameter count to CheXpert AUC.
  • Prior to fine-tuning, they removed up to four blocks from each model and compared CheXpert performance after different degrees of truncation.

Results: The team found no correlation between ImageNet accuracy and average CheXpert AUC scores after fine-tuning. Specifically, for pretrained models, the Spearman correlation was 0.082. Without pretraining, it was 0.059. However, ImageNet pretraining did lead to an average boost of 0.016 AUC in fine-tuned performance. For models without pretraining, the architecture influenced performance more than the parameter count did. For example, the average AUC of MobileNet varied by 0.005 across different sizes, while the difference between InceptionV3 and MobileNetV2 was 0.052 average AUC. Removing one block from a model didn’t hinder performance, but removing more did.

Why it matters: As researchers strive to improve performance on ImageNet, they may be overfitting to the dataset. Moreover, state-of-the-art ImageNet models are not necessarily ideal for processing domain-specific data.

We’re thinking: Language models have made huge advances through pretraining plus fine-tuning. It would be interesting to see the results of a similar analysis in that domain.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox