Visual robots typically perceive the three-dimensional world through sequences of two-dimensional images, but they don’t always know what they’re looking at. For instance, Tesla’s self-driving system has been known to mistake a full moon for a traffic light. New research aims to clear up such confusion.
What's new: Aljaž Božic and colleagues at Technical University of Munich released TransformerFusion, which set a new state of the art in deriving 3D scenes from 2D video.
Key Insight: The authors teamed two architectures and a novel approach to estimating the positions of points in space:
- Transformers excel at learning which features are most relevant to completing a particular task. In a video, they can learn which frames, and which parts of a frame, reveal the role of a given point in space: whether it’s empty or filled by an object.
- However, while transformers do well at selecting the best views, they’re not great at identifying points in space. The authors addressed this shortfall by refining the representations using 3D convolutional neural networks.
- To position the points in space, they generated representations at both course and fine scales. The coarse representations enabled the system to place points coherently across relatively large distances, while the fine representations enabled the system to reproduce details.
How it works: Given a series of 2D frames, TransformerFusion learned to reconstruct the 3D space they depicted by classifying whether each 3D pixel, or voxel, belonged (or was very close) to an object’s surface. The authors trained the system on ScanNet, a dataset that contains RGB-D (video plus depth) clips shot in indoor settings like bedrooms, offices, and libraries; object segmentations; and 3D scene reconstructions.
- Given a 2D frame, a ResNet-18 pretrained on ImageNet produced a coarse representation and a fine representation.
- A transformer mapped the coarse representations, along with information derived from the images such as viewing direction, to a 3D grid with 30-centimeter resolution and produced a new representation for each point. A second transformer mapped fine representations and other information to a 3D grid with 10-centimeter resolution and, likewise, produced a new representation for each point.
- Given the coarse representations, a 3D convolutional neural network learned to classify whether a point was near an object’s surface and refined the representations accordingly. If the point was near a surface, the system continued; otherwise, it classified the point as not near a surface and stopped to save computation.
- A second 3D CNN used both fine and coarse representations to learn how to classify, refine, and filter the representations of points on the fine grid.
- The system interpolated the remaining coarse and fine representations onto a 3D grid of even higher resolution (2 centimeters) and generated another set of point representations. Given each new point, a vanilla neural network classified whether there was an object at that point.
- The authors trained the system using three loss terms: one that encouraged the coarse CNN’s classifications to match ground truth, a similar one for the fine CNN, and a similar one for the higher-resolution CNN.
Results: The authors measured distances between TransformerFusion’s estimated points in space and ground truth. They considered an estimation correct if it matched ground truth within 5 centimeters. The system achieved an F-1 score, a balance of precision and recall where higher is better, of 0.655. The best competing method, Atlas, achieved 0.636. Without the 3D CNNs, TransformerFusion achieved 0.361.
Yes, but: Despite setting a new state of the art, TransformerFusion’s ability to visualize 3D scenes falls far short of human-level performance. Its scene reconstructions are distorted, and it has trouble recognizing transparent objects.
Why it matters: Transformers have gone from strength to strength — in language, 2D vision, molecular biology, and other areas — and this work shows their utility in a new domain. Yet, despite their capabilities, they can’t do the whole job. The authors took advantage of transformers where they could do well and then refined their output using an architecture more appropriate to 3D modeling.
We're thinking: Training systems on both low- and high-resolution versions of an image could improve other vision tasks as well.