The University of Hong Kong, and Cornell University to create Neural Body, a procedure that generates novel views of a single human character based on shots from only a few angles.

Key insight: An earlier approach called NeRF extracted a 3D model from images taken by as few as 16 still cameras, which could be used to synthesize an image from a novel angle. The authors took a similar approach but aggregated information not only from different angles but throughout the associated video frames. This enabled their system to match an actor’s pose from any angle, across successive frames, based on input from four cameras.

How it works: Neural Body creates a 3D model, poses it, and determines the colors to render from any viewpoint. The authors assembled a dataset of nine scenes shot from 21 angles. To synthesize a fresh angle on a particular scene, they trained the system on four angles chosen at random and tested it on the rest.

  • Given clips of a scene shot from four angles, the authors preprocessed the video frames to extract the human figure and remove the background. Then, for each frame, they used Total Capture to pose a deformable human model to match the image. This process generated a mesh model. They assigned a trainable vector to each vertex in the mesh.
  • SparseConvNet, a convolutional neural net specialized for 3D point clouds, learned to map (the authors use the word diffuse) the vertex vectors to a separate set of vectors for nearby positions on a 3D grid.
  • To determine the color of each pixel from a given viewing angle, the authors traced a ray from the camera through a pixel. At evenly spaced locations along the ray, they calculated representations based on the grid vectors. Given these representations, the locations along the ray, and the viewing angle, two fully connected networks predicted parameters needed to predict the color. Given the parameters, the volume rendering integral equation found the color. They repeated this process for all pixels.
  • The vertex representations, the SparseConvNet, and the two fully connected networks were trained together to minimize differences between predicted and actual images for all four videos.

Results: Given a frame from the training set and one of the 17 angles on which the system didn’t train, the authors compared the images generated by Neural Body to the actual images. They measured the peak-signal-to-noise ratio, a gauge of how well a generated image reproduces the original (higher is better). Neural Body achieved 27.87 average peak signal-to-noise ratio compared to NeRF’s 19.63.
Yes, but: The system produces only the character’s image. In practical use, a filmmaker would need to composite the character into a scene.

Why it matters: Models don’t always use available information efficiently during training. By integrating across video frames, rather than simply integrating different camera angles at the same moment in time, Neural Body is able to take advantage of all the information available to it.

We’re thinking: While shooting the Deep Learning Specialization, we tried an obtuse angle, but it was never right.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox