Better Crowd Counts A computer vision method for counting crowds from images

Published

Nov 25, 2020

Reading time

2 min read

Did a million people attend the Million Man March? Estimates of the crowd size gathered at a given place and time can have significant political implications — and practical ones, too, as they can help public safety experts deploy resources for public health or crowd control. A new method improves on previous crowd-counting approaches with a novel way to compare predictions with hand-labeled training data.

What’s new: DM-Count trains neural networks to count crowd size using optimal transport in the cost function. Optimal transport is a measure of difference between two distributions. In this case, the first distribution is the network’s prediction of people’s locations in a training example, and the second is the ground-truth locations. The method was developed by Boyu Wang and colleagues at Stony Brook University.

Key insight: Training datasets for crowd-counting models typically mark each person in an image with a single-pixel label. Training a network to match such labels is difficult, because tiny discrepancies in a label’s location count as errors. Previous approaches managed this problem by replacing the pixels with blobs, but choosing the right blob size is difficult given the wide range of sizes of people and parts of people in an image. Optimal transport gave the authors a way to compare the density of single-pixel predictions with that of single-pixel labels. Armed with this metric, they could measure the deformation necessary to match a matrix of predictions to the labels and apply a cost accordingly.

How it works: DM-Count accepts a picture of a crowd and places pixels where it sees people. Ideally, it would place one per person with 100 percent certainty, but in practice it spreads that certainty over a few pixels. In training, it learns to match those values to the training data using a loss function that combines three terms:

Optimal transport loss helps the model learn to minimize differences between the distributions of predictions and labels. It’s computationally expensive to calculate, so DM-Count approximates it using the Sinkhorn algorithm.
The Sinkhorn algorithm is less accurate in image areas that contain fewer people, so DM-Count applies an additional penalty based on the number of places in a predicted matrix that didn’t match the corresponding pixel-labels.
A third loss term works to minimize the difference between the predicted and labeled counts.

Results: The authors built a modified [VGG-19]https://arxiv.org/abs/1409.1556 as detailed in this paper and used DM-Count to train it on datasets including NWPU, which the authors considered the most challenging crowd-counting dataset. Their method achieved a mean absolute error of 88.4 compared to 106.3 for Context-Aware Crowd Counting, the previous state of the art.

Yes, but : Context-Aware Crowd Counting achieved a marginally lower root mean squared error (386.5) than DM-Count’s (388.6).

Why it matters: We often try to improve models by finding better ways to format training data such as replacing pixels with blobs. This work shows that finding new ways to evaluate a network’s predictions can be a good alternative.

We’re thinking: Can this method be adapted to check whether people in a crowd are maintaining proper social distance?

Subscribe to The Batch