3D Scene Synthesis for the Real World Generating 3D scenes with radiance fields and image data

Published

Jun 09, 2021

Reading time

2 min read

Researchers have used neural networks to generate novel views of a 3D scene based on existing pictures plus the positions and angles of the cameras that took them. In practice, though, you may not know the precise camera positions and angles, since location sensors may be unavailable or miscalibrated. A new method synthesizes novel perspectives based on existing views alone.

What’s new: Chen-Hsuan Lin led researchers at Carnegie Mellon University, Massachusetts Institute of Technology, and University of Adelaide in developing the archly named Bundle-Adjusting Neural Radiance Fields (BARF), a technique that generates new 3D views from images of a scene without requiring further information.

Key insight: The earlier method called NeRF requires camera positions and angles to find values that feed a neural network. Those variables can be represented by a learnable vector, and backpropagation can update it as well as the network’s weights.

How it works: Like NeRF, BARF generates views of a scene by sampling points along rays that extend from the camera through each pixel. It uses a vanilla neural network to compute the color and transparency of each point based on the point’s position and the ray’s direction. To determine the color of a given pixel, it combines the color and transparency of all points along the associated ray. Unlike NeRF, BARF’s loss function is designed to learn camera positions and angles, and it uses a training schedule to learn camera viewpoints before pixel colors.

As input, BARF takes images plus their viewpoint vectors. Given a novel viewpoint, it learns to minimize the difference between the predicted and ground-truth color of each pixel.
Points along separate rays that are close to one another have similar coordinates. The similarity makes it difficult to distinguish details and object boundaries in such areas. To work around this issue, BARF (like NeRF) represents points as fixed position vectors such that a small change in a point’s location causes a large change in its position vector.
This positional encoding helps the system reproduce scene details, but it inhibits learning of viewpoint vectors, since a large shift in the representation of nearby points causes the learned camera viewpoint to swing wildly without converging. To solve this problem, BARF zeroes out most of each position vector at the start of training and fills it in progressively as training progresses. Consequently, the network learns the correct camera perspective earlier in training and how to paint details in the scene later.

Results: The researchers compared BARF to NeRF, measuring their ability to generate a novel view based on several views of an everyday scene, where the viewpoints were unknown to BARF and known to NeRF. BARF achieved 21.96 competitive peak signal-to-noise ratio, a measure of the difference between the generated and actual images (higher is better). NeRF achieved 23.25 competitive peak signal-to-noise ratio.

Why it matters: Data collected in the wild rarely are perfect, and bad sensors are one of many reasons why. BARF is part of a new generation of models that don’t assume accurate sensor input, spurring hopes of systems that generalize to real-world conditions.

We’re thinking: In language processing, ELMo kicked off a fad for naming algorithms after Sesame Street characters. Here’s hoping this work doesn’t inspire its own run of names.

Subscribe to The Batch