We’d all love to be able to find similar examples of our favorite cat videos. But few of us want to label thousands of similar videos of even the cutest kitties. New research makes headway in video classification when training examples are scarce.

What’s new: Jingwei Ji and Zhangjie Cao led Stanford researchers in developing Ordered Temporal Alignment Module (Otam), a model that classifies videos even with limited training data.

Key insight: ImageNet provides over a million training examples for image classification models, while the Kinetics video dataset offers an order of magnitude fewer. But each video comprises hundreds of individual frames, so video datasets typically contain more images than image datasets. Why not take advantage of all those examples by applying image recognition techniques to videos? That way, each frame, rather than each video as a whole, serves as a training example.

How it works: The task is to find the training video most similar to an input video and apply the same label. A convolutional neural network pre-trained on ImageNet extracts features for each input frame. Then the system compares the features and finds an alignment between the frames of a novel video and those of a training video. The CNN comprises the only trainable parameters.

  • For each pair of frames from a novel video and a training video, Otam generates a similarity score. The pairs can be represented in a matrix whose rows are novel video frames and columns are training video frames. For example, (1,1) is the similarity between first frames and (2,1) represents the similarity between the second frame of the input video and the first frame of the training set video.
  • Otam constructs a path through the similarity matrix by connecting frame pairs that are most similar. If an input video and training videos are identical, the path follows the diagonal.
  • The system aligns similar frames over time even if the videos differ in length. For instance, if two videos depict different people brewing tea, and one person moves more slowly than the other, Otam will match frames essential to the action and ignore the extra frames that represent the slow-moving brewer. The system calculates video-video similarity by summing frame-frame similarities along the path. In this way, the CNN learns to extract features that lead to similar paths for videos of the same class.
  • Ordering frame pairs by similarity can’t be optimized directly via backprop. The researchers formulated a continuous relaxation that weights every possible path by its similarity. (A continuous relaxation takes a nondifferentiable, discrete problem and approximates it with a continuous function that has better-behaved gradients, so backprop can optimize it. For instance, softmax is a continuous relaxation for the operation argmax.)

Results: On the Kinetics dataset (clips of people taking various actions in a few sections each), Otam achieved one-shot accuracy of 73 percent, a big improvement over the previous state of the art, 68.4 percent. Otam similarly improved the state of the art on Something V2 dataset, which comprises clips of people interacting with everyday objects.

Why it matters: Some prior video classification systems also use pre-trained CNNs, but they include sequential layers that require lots of video data to train, since an entire video serves as a single training example. Otam eliminates much of that data hunger.

We’re thinking: Videos typically include a soundtrack. We hope the next iteration of Otam will compare sounds as well as images.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox