An object detector trained exclusively on urban images might mistake a moose for a pedestrian and express high confidence in its poor judgment. New work enables object detectors, and potentially other neural networks, to lower their confidence when they encounter unfamiliar inputs.
What’s new: Xuefeng Du and colleagues at University of Wisconsin-Madison proposed Virtual Outlier Synthesis (VOS), a training method that synthesizes representations of outliers to make an object detector more robust to unusual examples.
Key insight: Neural networks that perform classification (including object detectors) learn to divide high-dimensional space into regions that contain different classes of examples. Having populated a region with examples of a given class, they can include nearby empty areas in that region. Then, given an outlier, they’re likely to confidently label it with a class even if all familiar examples are far away. But a model can learn to recognize when low confidence is warranted by giving it synthetic points that fall into those empty areas and training it to distinguish between synthetic and actual points.
How it works: Given an image, an object detector generates two types of outputs: bounding boxes and classifications for those boxes. VOS adds a third: the model’s degree of certainty that the image is an outlier.
- For a batch of training images, the model proposed bounding boxes around regions that should contain objects.
- To synthesize an outlier, VOS looked at the representations at the network’s penultimate layer, then fit a Gaussian distribution to the representations of each class and sampled a representation with low probability. Conceptually, this is like drawing an ellipse around each class, then sampling a point close to the boundary of the ellipse. (They synthesized representations rather than images because it's easier to learn to generate relatively compact vectors than data-rich images.)
- To detect outliers, the authors added a logistic regression layer after the penultimate layer of the network. Given a representation of an image or a synthetic outlier, this layer learned to compute its likelihood of an outlier.
- The loss function consisted of a bounding box regression loss that taught the model to locate objects in an image, a bounding box classification loss that taught it to recognize the objects in the boxes, and an “uncertainty” loss that taught it to recognize certain objects (actually representations) as outliers.
Results: VOS maintained object detectors’ classification performance while reducing its false-positive rate. For instance, a ResNet-50 trained using VOS on a dataset that depicts persons, animals, vehicles, and indoor objects achieved object-detection performance of 88.66 percent AUC with a false-positive rate (FPR95) of 49.02 percent. By comparison, a ResNet-50 trained via a method that used a GAN to generate outlier images achieved slightly lower object-detection performance (83.67 percent AUC) and a much higher false-positive rate (60.93 percent FPR95).
Why it matters: It’s difficult to teach a neural network that the training dataset is just a subset of a diverse world. Moreover, the data distribution can drift between training and inference. VOS tackles the hard problem of encouraging object detectors to exercise doubt about unfamiliar objects without reducing their certainty with respect to familiar ones.
We’re thinking: The typical machine learning model learns about known knowns so it can recognize unknown knowns. While it’s a relief to have a neural network that identifies known unknowns, we look forward to one that can handle unknown unknowns.