Guest Speaker Deepfake method syncs up mouth movements with words.

Published

Mar 11, 2020

Reading time

2 min read

Deepfake videos in which one person appears to speak another’s words have appeared in entertainment, advertising, and politics. New research ups the ante for an application that enables new forms of both creative expression and misinformation.

What’s new: Linsen Song with researchers at China’s National Laboratory of Pattern Recognition, SenseTime Research, and Nanyang Technological University produced a model that makes a person on video appear to speak words from a separate audio recording with unprecedented realism. You can see the results in this video.

Key insight: Most people’s mouths move similarly when pronouncing the same words. The model first predicts facial expressions from the audio recording. Then it maps those predictions onto the target speaker’s face.

How it works: This approach works with any target video and source audio, synthesizes new motions, and maps them to a model of the target’s face frame by frame.

The audio-to-expression network learns from talking-head videos to predict facial motions from spoken words.
A portion of the network learns to remove personal quirks from the recorded voices, creating a sort of universal speaking voice. That way, individual vocal idiosyncrasies don’t bias the predicted mouth movements.
Software associated with the FaceWarehouse database of facial expression models extracts features of the target speaker’s face, such as head pose and positions of lips, nose, and eyes. The model generates a 3D mesh combining predicted mouth movements from the source audio with the target face.
In each target video frame, U-net architecture replaces the original mouth with a reconstruction based on the FaceWarehouse meshes.

Results: To test the model’s effectiveness quantitatively, the researchers evaluated its ability to resynthesize mouth movements from their original audio tracks in a video dataset. The model reduced the error in expression (average distance between landmark features) to 0.65 from a baseline of 0.84. In a qualitative study, viewers judged generated videos to have been real 65.8 percent of the time — a high score considering that they identified real videos as real 77.2 percent of the time.

Why it matters: Putting new words in a talking head’s mouth is getting easier. While previous approaches often impose prohibitive requirements for training, this method requires only a few minutes of video and audio data. Meanwhile, the results are becoming more realistic, lending urgency to the need for robust detection methods and clear rules governing their distribution.

We’re thinking: Let’s get this out of the way: We never said it!

Subscribe to The Batch