Computer Vision Transformed Google's Detection Transformer (DETR) for object detection

Published

Jul 08, 2020

Reading time

1 min read

The transformer architecture that has shaken up natural language processing may replace recurrent layers in object detection networks.

What’s new: A Facebook team led by Nicolas Carion and Francisco Massa simplified object detection pipelines by using transformers, yielding Detection Transformer (DETR).

Key insight: Images can show multiple objects. Some object detection networks use recurrent layers to predict one object at a time until all objects are accounted for. Language models use transformers to evaluate a sequence of words in one pass. Similarly, DETR uses them to predict all objects in an image in a single process.

How it works: DETR predicts a fixed number of object bounding boxes and classes per image. First, it extracts image features using convolutional layers. Then transformers predict features associated with regions likely to contain objects. Feed-forward layers process the object features into classes and bounding boxes. (“No object” is a possible class.)

The transformers generate object bounding boxes and labels as a sequence, but their order is arbitrary.
The loss function uses the Hungarian algorithm to match each object class (except “no object”) with a unique label. This makes predicting anchors (box center points) and complicated matching algorithms unnecessary.
During training, each transformer layer makes its own prediction. Evaluating this output ensures that all transformers learn to contribute equally — a technique borrowed from language models that’s not available with recurrent layers. The additional loss function especially helps the system predict the correct number of objects.

Results: The researchers pitted DETR against Faster R-CNN on the canonical object detection dataset Coco. At model sizes of roughly 40 million parameters, DETR bettered Faster R-CNN’s average precision, a measure of true positives, from 0.402 to 0.420. And DETR did it faster, spotting objects at 28 images per second compared to Faster R-CNN’s 26 images per second.

Why it matters: Transformers are changing the way machine learning models handle sequential data in NLP and beyond.

We’re thinking: What happened to the Muppet names for transformer-based models? Fozzie Bear is available.

Subscribe to The Batch