Who Robowatches the Robowatchmen?

Reading time
2 min read
LSFNet for the fusion of spatiotemporal descriptors

Do security cameras make your local bank or convenience store safer? These devices monitor countless locations around the clock, but it takes people watching to evaluate their output — an expensive, exhausting, and error-prone solution. Researchers at iCetana and the University of Western Australia took a step toward detecting critical events without humans in the loop.

What's new: Lei Wang, Du Q. Huynh, and Moussa Reda Mansour developed a lightweight architecture (one that can be trained on a MacBook Pro in half a day!) that differentiates between human motion and background movement like trees swaying, rain falling, and camera shaking in videos of outdoor scenes.

Key insight: Features belonging to the same class flock together in multi-dimensional vector space, yet typical methods of reducing the number of dimensions can lose this clustering information. The authors proposed a novel training procedure that shrinks dimensionality without interfering with the ability to group similar classes.

How it works: Loss Switching Fusion Network (LSFNet) fuses the handcrafted features known as dense trajectories, which are commonly used to detect people moving in videos, in a way that retains clustering information. Then it’s simple to distinguish videos that have human motion from those that show rain, trees waving, camera shaking, shifting illumination, and noisy video.

  • LSFNet consists of two parts: an autoencoder to fuse dense trajectories and a neural network  to classify different kinds of motion. During the first phase of training, the system alternates between optimizing loss functions corresponding to the autoencoder and the neural network.
  • In training phase 2, the authors pass the dense trajectories through the pretrained autoencoder to get a 128-dimensional feature vector. This feature vector is reduced in dimension and hashed into a 64 (or lower) dimensional representation.
  • During testing, they compare the lower-dimension feature of each test example with that of each training video. Then they find the likelihood that a given example falls into a particular class by soft voting on the K most similar training videos.

Results: Features processed by LSFNet produce visibly better classification compared to those produced by standard autoencoders or PCA. They outperform all state-of-the-art background- and foreground-motion classification techniques.

Why it matters: It’s hard to keep an eye on a half-dozen security-camera screens at once. AI that distinguishes between human activity and innocuous background motions could make video surveillance more effective in real time.

We're thinking: AI-equipped security systems likely will lead to legitimate concerns about privacy. But it may also mean more effective crime detection. AI companies can take the lead by addressing concerns proactively while working to maximize the benefits.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox