Vision transformers need architecture modifications and retraining from scratch to be used for object detection — or so most researchers thought. New work used vision transformers for object detection without the usual redesign and training.
What’s new: Yanghao Li and colleagues at Facebook proposed ViTDet, which adds an object detector to a plain pretrained transformer.
Key insight: Vision transformers (ViTs) have rivaled convolutional neural nets (CNNs) in many vision tasks — but not object detection. That’s because a CNN’s hierarchical architecture, in which different-sized layers produce representations at different scales of an image, helps to spot objects of any size. Consequently, copying this architecture is a natural choice for transformers for vision tasks, and many ViT variations for object detection feature a hierarchical implementation (known as a backbone that supports a detection-specific head/neck). A simpler solution, though, is to add hierarchical layers to the end of a vanilla ViT backbone. This avoids the need to redesign the network, and it enables object detection models to benefit from pretrained ViTs that weren’t developed with that task in mind.
How it works: ViTDet combines a ViT pretrained on ImageNet, which produces a representation of an input image, with Mask R-CNN’s prediction layers, an established component for object detection and image segmentation. The authors fine-tuned the system for those tasks on an augmented version of COCO. They made the following alterations prior to fine-tuning:
- To help the system recognize objects of different scales in the input image, they applied convolutions and deconvolutions to ViT’s representation, producing representations at four scales. For each representation, the Mask R-CNN layers computed object labels, bounding boxes, and segmentation masks.
- To enable the self-attention mechanism to process higher-resolution input, they split its input into non-overlapping windows (the size of the normal input during pretraining) and limited self-attention to occur within those windows. To enable information to propagate across the windows, they added four convolutional layers to ViT. To avoid the need to retrain ViT from scratch, they initialized the convolutional layers to pass the representation through the layer without modification.
- They augmented the fine-tuning set via large-scale jittering augmentation. This augmentation helps a model learn how objects look at different scales by shrinking images by a random factor and placing them at the top-left corner of an upscaled 1024x1024-pixel canvas.
Results: A ViTDet based on ViT-Huge performed bounding-box detection with 61.3 average precision (a measure of how many objects were correctly identified in their correct location, higher is better) and instance segmentation with 53.1 average precision. SwinV2-L, based on a transformer with a hierarchical architecture, performed bounding-box detection with 60.2 average precision and instance segmentation with 52.1 average precision.
Why it matters: Decoupling the vision model’s design and training from the object-detection stage is bound to accelerate progress on transformer-based object detection systems. If any pretrained transformer can be used for object detection directly off the shelf, then any improvement in pretrained transformers will yield better representations for object detection.
We’re thinking: This work opens opportunities to improve all manner of object detection and segmentation subtasks.