If Mona Lisa Could Talk

Published

May 29, 2019

Reading time

2 min read

The art of video fakery leaped forward with a system that produces lifelike talking-head videos based on a single still portrait.

What’s new: Deepfake videos typically are created by feeding thousands of images of a person to a neural network. Egor Zakharov and his colleagues at Samsung and Skolkovo Institute of Science and Technology devised a method that needs only a handful — even just one — to generate good results.

How it works: The system consists of three networks, all of them pre-trained on thousands of talking-head videos of diverse speakers:

The embedder maps video frames and facial landmarks to vectors that capture characteristics of a variety of faces.
The generator maps input face landmarks from different portions of the same data set onto output frames. It uses both trained weights and drop-in weights derived from the embedder outputs.
The discriminator and generator engage in an adversarial process to learn how to produce images of greater realism. In addition, the system minimizes the differences between original and synthesized images to preserve the speaker's identity.
The resulting meta-learning setup learns to combine source and target landmarks in single image.
The generator and discriminator are fine-tuned on source and target images to create output that's more true to the source.

Results: Test subjects were asked to rate the system’s output as fake or real. When the system had been fine-tuned on just one image, they were fooled a considerable amount of the time. When it had been fine-tuned on 32 photos, their guesses were random. See for yourself in this video.

Why it matters: The new technique drastically cuts the time, computation, and data needed to produce lifelike video avatars. The researchers suggest using it to create doppelgängers for gaming and videoconferencing — that is, using imagery of a single person both to generate the talking head and drive its behavior.

The fun stuff: The most dramatic results arose from animating faces from iconic paintings, including the Mona Lisa. How wonderful it would be to see her come to life and tell her story! Applications in entertainment and communications are as intriguing as the potential for abuse is worrisome.

We’re thinking: The ability to make convincing talking-head videos from a few snapshots, combined with voice cloning tech that generates realistic voices from short audio snippets, portends much mischief. We're counting on the AI community to redouble its effort to build countermeasures that spot fakes.

Subscribe to The Batch