Who Was That Masked Input? Pretraining Method Improves Computer Vision Performance

Published

Jul 06, 2022

Reading time

2 min read

Researchers have shown that it’s possible to train a computer vision model effectively on around 66 percent of the pixels in each training image. New work used 25 percent, saving computation and boosting performance to boot.

What's new: Kaiming He, Xinlei Chen, and colleagues at Facebook developed a pretraining method they call Masked Auto-Encoder (MAE). Given a fixed processing budget, MAE pretrained a larger model three times faster, resulting in higher performance in less computation than earlier methods.

Key insight: In a masked training scenario (in which portions of each training example are masked and the model learns to fill in the blanks), the larger the mask, the less computation is required. At the same time, it’s axiomatic that bigger neural networks make for better learning. Combining a very large mask with a very high parameter count should result in better performance with less computation.

How it works: A typical autoencoder uses an encoder and decoder to generate representations for use by a different model. During training, the encoder learns to create a representation of the input, and the decoder learns to use the representation to reproduce the input. The authors used transformers for the encoder and decoder, and the encoder’s parameter count was roughly an order of magnitude greater than the decoder’s. They pretrained it on ImageNet examples that had been heavily masked. Then they fine-tuned the encoder’s representations on ImageNet as well.

Following Vision Transformer, the authors divided each training example into patches. They masked 75 percent of patches at random and passed the unmasked patches to the encoder, which produced a representation of each one.
Given the representations, the decoder reconstructed the entire image.
The loss function encouraged the decoder to minimize the difference between a reconstructed image and the original.
To fine-tune the representations for ImageNet classification, the authors appended a fully connected layer to the encoder and discarded the decoder.

Results: MAE’s fine-tuned representations achieved 85.9 percent accuracy on ImageNet classification, outperforming representations learned from scratch using the same architecture (82.6 percent) and BEiT, an earlier masked training method that used less masking, a smaller encoder, and a different random masking strategy (85.2 percent). MAE trained 3.7 times faster than the same architecture without masking and up to 3.5 times faster than BEiT.

Why it matters: Given a larger model, providing less information at input is not necessarily a disadvantage. Rather, it can improve both computational efficiency and performance.

We're thinking: Would a similar design that pairs heavy masking and a plus-sized encoder boost training efficiency in large language models?

Subscribe to The Batch