In this work-from-home era, who hasn’t spent a video conference wishing they could read an onscreen document without turning their eyes from the person they’re talking with? Or simply hoping the stream wouldn’t stutter or stall? Deep learning can fill in the missing pieces.
What’s new: Maxine is a media streaming platform from Nvidia. It replaces compression-decompression software with neural networks, using one-tenth the typical H.264 bandwidth. It can also enhance resolution to transmit a sharper picture, alter the video image in useful and creative ways, and deliver additional audio and language services.
How it works: Maxine is available to video conference providers through major cloud computing vendors. This video illustrates some of the system’s capabilities. Avaya, which plans to implement some features in its Spaces video conferencing app, is the only customer named so far.
- Rather than transmitting a river of pixels, a user’s computer sends periodic keyframes along with locations of facial keypoints around expressive areas like the eyes, nose, and mouth.
- A generative adversarial network (GAN) synthesizes in-between frames, generating areas around the keypoints. In addition, the GAN can adjust a speaker’s face position and gaze or transfer keypoint data into an animated avatar that speaks in the user’s voice while mimicking facial expressions. The GAN is trained to work with faces wearing masks, glasses, hats, and headphones.
- Other models manage audio services such as conversational chatbots and noise filtering, as well as language services such as automatic translation and transcription.
Why it matters: The volume of video data on the internet was growing exponentially before the pandemic hit, and since then, video conferencing has exploded. Neural networks can reclaim much of that bandwidth and boost quality in the bargain, scaling up the resolution of pixelated imagery, removing extraneous sounds, and providing expressive animated avatars and informative synthetic backgrounds.
We’re thinking: AI is working wonders for signal processing in both video and audio domains. Streaming is great, but also look for GANs to revolutionize image editing and video production.