The latest update of the acclaimed real-time object detector You Only Look Once is more accurate than ever.
What’s new: Alexey Bochovskiy, Chien-Yao Wang, and Hong-Yuan Mark Liao at Taiwan’s Institute of Information Science Academia Sinica offer YOLOv4 — the first version not to include the architecture’s original creators.
Key insight: Rapid inference is YOLO’s claim to fame. The authors prioritized newer techniques that improve accuracy without impinging on speed (their so-called “bag of freebies”). In addition, they made improvements that boost accuracy at a minimal cost to speed (the “bag of specials”). All told, these tweaks enable the new version to outperform both its predecessor and high-accuracy competitors running at real-time frame rates.
How it works: YOLO, as well as most object detectors since, tack a model that predicts bounding boxes and classes onto a pre-trained ImageNet feature extractor.
- Techniques under the heading “bag of freebies” boost accuracy by adding computation during training. These include alternate bounding box loss functions, data augmentation, and decreasing the model’s confidence for ambiguous classes.
- The authors introduce new data augmentation techniques such as Mosaic, which mixes elements drawn from four training images to place objects in novel contexts.
- “Bag of specials” techniques include the choice of activation function: ReLU variants are marginally slower, but they can yield better accuracy.
- The authors accommodate users with limited hardware resources by choosing techniques that allow training on a single, reasonably affordable GPU.
Results: The authors pitted YOLOv4 against other object detectors that process at least 30 frames per second, using the COCO image dataset. YOLOv4 achieved 0.435 average precision (AP), running at 62 frames per second (FPS). It achieved 0.41 AP at its maximum rate of 96 FPS. The previous state of the art, EfficientDet, achieved 0.43 AP running at nearly 42 FPS and 0.333 AP at its top speed of 62 FPS.
Why it matters: YOLOv4 locates and classifies objects faster than measurements of human performance. While it’s not as accurate as slower networks such as EfficientDet, the new version boosts accuracy without sacrificing speed.
We’re thinking: You only look once . . . twice . . . thrice . . . four times and counting!