Pretraining methods generate basic representations for later fine-tuning, but they’re prone to certain issues that can throw them off-kilter. New work proposes a solution.
What’s new: Researchers at Facebook, PSL Research University, and New York University led by Adrien Bardes devised an unsupervised pretraining method they call Variance-Invariance-Covariance Regularization (VICReg). VICReg helps a model learn useful representations based on well understood statistical principles.
Key insight: Pretraining methods can suffer from three common failings: Generating an identical representation for different input examples (which leads to predicting the mean consistently in linear regression), generating dissimilar representations for examples that humans find similar (for instance, the same object viewed from two angles), and generating redundant parts of a representation (say, multiple vectors that represent two eyes in a photo of a face). Statistically speaking, these problems boil down to issues of variance, invariance, and covariance respectively.
How it works: VICReg manages variance, invariance, and covariance via different terms in a loss function. The authors used it to pretrain a ResNet-50 on ImageNet without labels.
- To discourage similar representations of every example, the variance term of VICReg’s loss function computes the variance within an input batch’s representations; that is, the average amount by which each value differs from the mean. This term penalizes the model if this variance falls below a threshold.
- The covariance term computes correlations between elements of each representation. It sums the correlations and penalizes the model for extracting correlated features within a given representation.
- To prevent dissimilar representations of similar examples, VICReg borrows an idea from contrastive learning: It uses data augmentation. Two different, random augmentations are applied to each example, and the model processes them separately to generate two different, but related, representations. The invariance term computes the distance between them. The greater the distance, the greater the penalty.
Results: The authors transferred the VICReg-trained ResNet-50’s representations to a linear classifier and trained it on ImageNet with labels. That model achieved a 73.2 percent accuracy, just shy of the 76.5 percent achieved by a supervised ResNet-50. A linear classifier using representations from a ResNet-50 pretrained using the contrastive learning method SimCLR achieved 69.3 percent accuracy.
Why it matters: Contrastive learning, a successful pretraining technique, requires a large number of comparisons between dissimilar inputs to ensure that not all representations are identical. VICReg avoids that issue by computing the variance within a batch, a much less memory-intensive operation.
We’re thinking: Comparing different augmentations of the same example has proven to be a powerful way to learn. This technique extends that approach, and we expect to see more.