Swin

3 Posts

Shifted Patch Tokenization (SPT) | Locality Self-Attention (LSA)

Less Data for Vision Transformers: Boosting Vision Transformer Performance with Less Data

Vision Transformer (ViT) outperformed convolutional neural networks in image classification, but it required more training data. New work enabled ViT and its variants to outperform other architectures with less training data.

Overview of Mobile-Former | Cross attention over the entire featuremap for the first token in Mobile→Former

Swin

High Accuracy at Low Power: An energy efficient method for computer vision

Equipment that relies on computer vision while unplugged — mobile phones, drones, satellites, autonomous cars — need power-efficient models. A new architecture set a record for accuracy per computation.

Animated image showing the transformer architecture of processing an image

Swin

Transformer Speed-Up Sped Up: How to Speed Up Image Transformers

The transformer architecture is notoriously inefficient when processing long sequences — a problem in processing images, which are essentially long sequences of pixels. One way around this is to break up input images and process the pieces

Swin

Less Data for Vision Transformers: Boosting Vision Transformer Performance with Less Data

High Accuracy at Low Power: An energy efficient method for computer vision

Transformer Speed-Up Sped Up: How to Speed Up Image Transformers

Subscribe to The Batch