Swin

3 Posts

Shifted Patch Tokenization (SPT) | Locality Self-Attention (LSA)
Swin

Less Data for Vision Transformers: Boosting Vision Transformer Performance with Less Data

Vision Transformer (ViT) outperformed convolutional neural networks in image classification, but it required more training data. New work enabled ViT and its variants to outperform other architectures with less training data.
Overview of Mobile-Former | Cross attention over the entire featuremap for the first token in Mobile→Former
Swin

High Accuracy at Low Power: An energy efficient method for computer vision.

Equipment that relies on computer vision while unplugged — mobile phones, drones, satellites, autonomous cars — need power-efficient models. A new architecture set a record for accuracy per computation.
Animated image showing the transformer architecture of processing an image
Swin

Transformer Speed-Up Sped Up: How to Speed Up Image Transformers

The transformer architecture is notoriously inefficient when processing long sequences — a problem in processing images, which are essentially long sequences of pixels. One way around this is to break up input images and process the pieces

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox