Computer vision models typically draw bounding boxes around objects they spot, but those rectangles are a crude approximation of an object’s outline. A new method finds keypoints on an object’s perimeter to produce state-of-the-art object classification.
What’s new: Ze Yang and researchers from Peking University, Tsinghua University, and Microsoft Research developed a network, RPDet, that extracts what the authors call representation points, or RepPoints.
Key insights: Bounding boxes can be constructed from RepPoints, which enables RPDet to learn to derive RepPoints from bounding-box labels in standard object-recognition datasets. A good RepPoint is one that helps to answer two questions: What is the bounding box, and what object does it enclose?
How it works: RPDet uses feature pyramidal networks that extract a hierarchy of image features of varying levels of detail. From these features, it extracts a user-defined number of points as follows:
- The model starts by identifying the center point.
- It infers the remaining points from that one using deformable convolutions. Typical convolutions learn only weights, and they’re appropriate for bounding boxes because of their rectangular structure. Deformable convolutions learn offsets as well. The offsets define a custom shape, as opposed to the usual grid.
- The model constructs a bounding box around the RepPoints by finding the smallest box that contains all points. RPDet is trained via backpropagation to match bounding box corners in the training data.
- Having located objects by finding their RepPoints, RPDet classifies the objects. This additional task encourages RPDet to identify important locations on an object and avoid fixating on bounding-box corners.
Results: Processing image features supplied by a ResNet, RPDet achieved a 2 percent boost in classification accuracy over bounding-box representations. Further, RPDet achieves a new state of the art for precision on COCO, an object detection and classification dataset, with 4 percent improvement in average precision over the alternatives considered.
Why it matters: This technique encodes relatively detailed information about object shapes that could be useful in a variety of tasks. For instance, RepPoints’ implicit estimation of poses could help predict the trajectory of a moving object.
We’re thinking: Plenty of applications, including face recognition, find explicit predefined keypoints. But they tend to be specialized for specific types of objects, such as finding the eyes, nose, and mouth on faces. RepPoints encode arbitrary geometry and pose information for a wide range of shapes, giving them a potential role in applications that otherwise wouldn’t be feasible.