Equipment that relies on computer vision while unplugged — mobile phones, drones, satellites, autonomous cars — need power-efficient models. A new architecture set a record for accuracy per computation.
What's new: Yinpeng Chen and colleagues at Microsoft devised Mobile-Former, an image recognition system that efficiently weds a MobileNet’s convolutional eye for detail with a Vision Transformer’s attention-driven grasp of the big picture.
Key insight: Convolutional neural networks process images in patches, which makes them computationally efficient but ignores global features that span multiple patches. Transformers represent global features but they’re inefficient. A transformer’s self-attention mechanism compares each part of an input to each other part, so the amount of computation requires grows quadratically with the size of the input. Mobile-Former combines the two architectures, but instead of using self-attention, its transformers compare each part of an input to a small learned vector. This gives the system information about global features without the computational burden.
How it works: Mobile-Former is a stack of layers, each made up of three components: a MobileNet block and transformer block joined by a two-way bridge of two attention layers (one for each direction of communication). The MobileNet blocks refine an image representation, the transformer blocks refine a set of six tokens (randomly initiated vectors that are learned over training), and the bridge further refines the image representation according to the tokens and vice versa. The authors trained the system on ImageNet.
- Given an image, a convolutional layer generates a representation. Given the representation and the tokens, the bridge updates the tokens to represent the image. This starts an iterative process:
- A MobileNet block refines the image representation and passes it to the bridge.
- A transformer block refines the tokens based on the relationships between them and passes them to the bridge.
- The bridge updates the image representation according to the tokens, and the tokens according to the image representation, and passes them all to the next series of blocks.
- The process repeats until, at the end of the line, two fully connected layers render a classification.
Results: Mobile-Former beat competitors at a similar computational budget and at much larger budgets as well. In ImageNet classification, it achieved 77.9 percent accuracy using 294 megaflops (a measure of computational operations), beating transformers that required much more computation. The nearest competitor under 1.5 gigaflops, Swin, scored 77.3 percent using 1 gigaflop. At a comparable budget of 299 megaflops, a variation on the ShuffleNetV2 convolutional network scored 72.6 percent accuracy.
Yes, but: The system is not efficient in terms of the number of parameters and thus memory requirements. Mobile-Former-294M encompasses 11.4 million parameters, while Swin has 7.3 million and ShuffleNetV2 has 3.5 million. One reason: Parameters in the MobileNet blocks, transformer blocks, and bridge aren’t shared.
Why it matters: Transformers have strengths that have propelled them into an ever wider range of applications. Integrating them with other architectures makes it possible to take advantage of the strengths of both.
We're thinking: Using more than six tokens didn’t result in better performance. It appears that the need for attention in image tasks is limited — at least for images of 224x224 resolution.