10-Second Crystal Ball

Published

Jun 05, 2019

Reading time

2 min read

Automated image recognition raises an ethical challenge: Can we take advantage of this useful technology without impinging on personal privacy and autonomy? That question becomes more acute with new research that uses imagery to predict human actions.

What’s new: Researchers combined several video processing models to create a state-of-the-art pipeline for predicting not only where pedestrians will go, but also what they’ll do next, within a time horizon of several seconds. Predicting actions, they found, improves the system’s trajectory predictions. Watch a video of the system in action here.

How it works: The architecture called Next predicts paths using estimations of scene variables like people, objects, and actions. Modules devoted to person behavior and person interaction create a feature tensor unique to each person. The tensor feeds into modules for trajectory generation, activity prediction, and activity location prediction.

The person behavior module detects people in a scene and embeds changes in their appearance and movement in the feature tensor.
The person interaction module describes the scene layout and relationships between every person-and-object pair in the feature tensor.
The trajectory generator uses an LSTM encoder/decoder to predict future locations of each person.
The activity prediction module computes likely actions for every person, at each location, in each time step. This auxiliary task mitigates errors caused by inaccurately predicted trajectories.
The activity location prediction module predicts a region to obtain a final action from the activity prediction module. A regression to the final position within that region forces the trajectory generator and activity prediction module to agree.

Junwei Liang and his team at Carnegie Mellon, Google, and Stanford trained Next using a multi-task loss function, which combines errors in predicted trajectories, activity location versus trajectory loss, and activity classification. The loss is summed over all people in the scene.

Why it matters: Beyond its superiority at predicting where people are heading, Next is the first published model that predicts both peoples’ trajectories and their activities. Prior models predicted actions over less than half the time horizon and were less accurate, while Next seems to approach a human’s predictive capability.

We’re thinking: The ability to anticipate human actions could lead to proactive robot assistants and life-saving safety systems. But it also has obvious — and potentially troubling — applications in surveillance and law enforcement. Pushing the boundaries of action prediction is bound to raise as many questions as it answers.

Subscribe to The Batch