Another week, another way to make deepfakes. A team at Samsung recently proposed a model that generates talking-head videos by imposing facial landmarks over still images. Now a different team offers one that makes an onscreen speaker say anything you can type.
What’s new: Ohad Fried and fellow researchers at Stanford, Adobe, and Princeton unveiled a system that morphs talking-head videos to match any script. Simply type in new words, and the speaker delivers them flawlessly. Check out the video.
How it works: Given a video and a new script, the model identifies where to make edits and finds syllables in the recording that it can patch together to render the new words. Then it reconstructs the face. The trick is to match mouth and neck movements with desired verbal edits while maintaining consistent background and pose:
- Face tracking obtains facial properties — orientation, shape, and expression — as well as environmental characteristics like lighting.
- Given new text, the system searches over a transcript of the original recording for phonemes useful in constructing the new words. Then it extracts the corresponding frames and facial features and substitutes them for the originals.
- Extracted frames can change both the background and speaker’s pose. The system mitigates such artifacts by further swapping in nearest neighbours from the original video around the edit location.
- In parallel with these modifications, the system interpolates extracted facial features with original features at the edit location to produce a smoother sequence.
- A recurrent GAN replaces mouth and neck movements in the nearest-neighbor images to those matching the interpolated facial parameters.
- For now the system works on video only. The researchers re-recorded the speaker’s voice or synthesized a new one to fill in the sound track. But there may be better options.
Results: Fried’s new technique allows more flexible editing and creates more detailed reconstructions than earlier methods. Test subjects barely noticed the editing.
Limitations: The approach only reshapes in the mouth and neck region (and the occasional hand entering that region). That leaves room for expressive inconsistencies like a deadpan face when the new script calls for surprise.
Why it matters: Video producers will love this technology. It lets them revise and fix mistakes without the drudgery of re-recording or manually blending frames. Imagine a video lecture tailored to your understanding and vocabulary. Producing a slightly different lecture for each student would be a monumental task for human beings, but it’s a snap for AI.
We’re thinking: It goes without saying that such technology is ripe for abuse. Fried recommends that anyone using the technology disclose that fact up-front. We concur but hold little optimism that fakers will comply.