Vision Models Get Some Attention Researchers add self-attention to convolutional neural nets.

Published

Mar 31, 2021

Reading time

2 min read

Self-attention is a key element in state-of-the-art language models, but it struggles to process images because its memory requirement rises rapidly with the size of the input. New research addresses the issue with a simple twist on a convolutional neural network.

What’s new: Aravind Srinivas and colleagues at UC Berkeley and Google introduced BoTNet, a convolutional architecture that uses self-attention to improve average precision in object detection and segmentation.

Key insight: Self-attention and convolution have complementary strengths. Self-attention layers enable a model to find relationships between different areas of an image, while convolutional layers help the model to capture details. Self-attention layers work best when inputs are small, while convolutional layers can shrink input size. Combining the two offers the best of both worlds.

How it works: BoTNet-50 is a modified ResNet-50. The authors trained it for COCO’s object detection and segmentation tasks — that is, to draw bounding boxes around objects and determine what object each pixel belongs to — via Mask R-CNN, a method that details how to train and set up the network architecture for these tasks.

Some ResNets use bottleneck blocks, which perform three layers of convolutions. The first layer reduces the input size, the second extracts representations, and the third converts its input back to the original size.
BoTNeT adopts this structure, but in the last three blocks of the network, the authors replaced the second convolutional layer with a self-attention layer.

Results: BoTNet-50 beat a traditional ResNet-50 in both object detection and segmentation. Averaged over all objects in COCO, more than half of pixels that BoTNet associated with a given object matched the ground-truth labels 62.5 percent of the time, while the ResNet-50 achieved 59.6 percent. For a given object, more than half of BoTNet’s predicted bounding box overlapped with the ground-truth bounding box 65.3 percent of the time, compared to 62.5 percent for the ResNet-50.

Why it matters: Good ideas in language processing can benefit computer vision and vice versa.

We’re thinking: Convolution is almost all you need.

Subscribe to The Batch