Vision Transformer

12 Posts

Synthetic Data Helps Image Classification: StableRep, a method that trains vision transformers on images generated by Stable Diffusion
Vision Transformer

Synthetic Data Helps Image Classification: StableRep, a method that trains vision transformers on images generated by Stable Diffusion

Generated images can be more effective than real ones in training a vision model to classify images. Yonglong Tian, Lijie Fan, and colleagues at Google and MIT introduced StableRep, a self-supervised method that trains vision transformers on images generated by...
Masked Pretraining for CNNs: ConvNeXt V2, the new model family that boosts ConvNet performance
Vision Transformer

Masked Pretraining for CNNs: ConvNeXt V2, the new model family that boosts ConvNet performance

Vision transformers have bested convolutional neural networks (CNNs) in a number of key vision tasks. Have CNNs hit their limit? New research suggests otherwise.
Vision Transformers Made Manageable: FlexiViT, the vision transformer that allows users to specify the patch size
Vision Transformer

Vision Transformers Made Manageable: FlexiViT, the vision transformer that allows users to specify the patch size

Vision transformers typically process images in patches of fixed size. Smaller patches yield higher accuracy but require more computation. A new training method lets AI engineers adjust the tradeoff.
Illustration of a snowman with a top hat and glasses
Vision Transformer

AI's Eyes Evolve: Vision transformer research exploded in 2022.

Work on vision transformers exploded in 2022. Researchers published well over 17,000 ViT papers during the year. A major theme: combining self-attention and convolution.
Animation showing 3 main types of data augmentation and random cropping of a picture
Vision Transformer

Cookbook for Vision Transformers: A Formula for Training Vision Transformers

Vision Transformers (ViTs) are overtaking convolutional neural networks (CNN) in many vision tasks, but procedures for training them are still tailored for CNNs. New research investigated how various training ingredients affect ViT performance.
Illustration shows different self-attention mechanisms used by Transformer-based AI models.
Vision Transformer

Attention to Rows and Columns: Altering Transformers' Self-Attention Mechanism for Greater Efficiency

A new approach alters transformers' self-attention mechanism to balance computational efficiency with performance on vision tasks.
Object-Detection Transformers Simplified: New Research Improves Object Detection With Vision Transformers
Vision Transformer

Object-Detection Transformers Simplified: New Research Improves Object Detection With Vision Transformers

ViTDet, a new system from Facebook, adds an object detector to a plain pretrained transformer.
A series of graphs show the carbon emissions associated with training AI models.
Vision Transformer

Cutting the Carbon Cost of Training: A New Tool Helps NLP Models Lower Their Gas Emissions

You can reduce your model’s carbon emissions by being choosy about when and where you train it.
Masked Auto-Encoder (MAE) explanation
Vision Transformer

Who Was That Masked Input? Pretraining Method Improves Computer Vision Performance

Researchers have shown that it’s possible to train a computer vision model effectively on around 66 percent of the pixels in each training image. New work used 25 percent, saving computation and boosting performance to boot.
Architecture of CXV
Vision Transformer

Upgrade for Vision Transformers: Improved Efficiency for Vision Transformers

Vision Transformer and models like it use a lot of computation and memory when processing images. New work modifies these architectures to run more efficiently while adopting helpful properties from convolutions.
Shifted Patch Tokenization (SPT) | Locality Self-Attention (LSA)
Vision Transformer

Less Data for Vision Transformers: Boosting Vision Transformer Performance with Less Data

Vision Transformer (ViT) outperformed convolutional neural networks in image classification, but it required more training data. New work enabled ViT and its variants to outperform other architectures with less training data.
Overview of Mobile-Former | Cross attention over the entire featuremap for the first token in Mobile→Former
Vision Transformer

High Accuracy at Low Power: An energy efficient method for computer vision

Equipment that relies on computer vision while unplugged — mobile phones, drones, satellites, autonomous cars — need power-efficient models. A new architecture set a record for accuracy per computation.

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox