Alexa, Read My Lips Amazon Alexa uses visual clues to determine who is talking.

Published

Oct 07, 2020

Reading time

1 min read

Amazon’s digital assistant is using its eyes as well as its ears to figure out who’s talking.

What’s new: At its annual hardware showcase, Amazon introduced an Alexa skill that melds acoustic, linguistic, and visual cues to help the system keep track of individual speakers and topics of conversation. Called natural turn-taking, the skill should be available next year.

How it works: Natural turn-taking fuses analyses of data from the microphone and camera in devices like the Echo Show, Echo Look, and Echo Spot.

To determine whether a user is speaking to Alexa, the system passes photos of the speaker through a pose detection algorithm to see which way they’re facing. It also passes the voice recording through an LSTM that extracts features and a speech recognition model to decide whether the words were directed at the device. It fuses the models’ outputs to make a determination.
The new skill also makes Alexa more responsive to interruptions. For instance, if a user asks for a Bruce Springsteen song and then says, “Play Charlie Parker instead,” Alexa can pivot from the Boss to the Bird.
The skill understands indirect requests, like when a user butts in with “That one” while Alexa is reading a list of take-out restaurants. The system time-stamps such interruptions to figure out what the user was referring to, then passes that information to a dialogue manager model to formulate a response.

Why it matters: In conversation, people interrupt, talk over one another, and rarely use each other’s names. Making conversational interactions with AI more fluid could be handy in a wide variety of settings.

We’re thinking: Alexa now tolerates users interrupting it. Will users eventually tolerate Alexa interrupting them?

Subscribe to The Batch