Preserving Detail in Image Inputs Better image compression for computer vision datasets

Published

Apr 22, 2020

Reading time

2 min read

Given real-world constraints on memory and processing time, images are often downsampled before they’re fed into a neural network. But the process removes fine details, and that degrades accuracy. A new technique squeezes images with less compromise.

What’s new: Researchers at the Alibaba DAMO Academy and Arizona State University led by Kai Xu reduce the memory needed for image processing by using a technique inspired by JPEG image compression.

Key insight: JPEG removes information the human eye won’t miss by describing patterns of pixels as frequencies. Successive color changes from pixel to pixel are higher frequencies, while monochromatic stretches are lower frequencies. By cutting frequencies that have little visual effect, the algorithm compresses images with minimal impact on image quality. The researchers employed a similar strategy to reduce input data without losing information critical to learning.

How it works: The researchers transformed images into the frequency domain, selected frequencies to remove, and fed the reduced frequency representation into ResNet-50 and MobileNet V2 models.

The algorithm starts by converting RGB images to YCbCr format, which specifies brightness, red, and blue in each pixel. Humans are especially sensitive to brightness, making this format good for data reduction.
It transforms the YCbCr image into a frequency representation of the same size. Then it groups similar frequencies into channels (which longer capture brightness and color). The grouping increases the number of channels by a fixed amount but reduces height and width of the images to a neural network-friendly size.
The researchers propose two methods to decide which frequencies to discard. In one, a separate model learns to turn each channel on or off based on how it affects classification performance. In the other, they use rules based on observation; for example, lower frequency channels tend to capture more useful information.

Results: ResNet-50 trained on ImageNet in the usual way achieves 76 percent top-1 accuracy, but slimming the input in the frequency domain increased accuracy by 1.38 percent. A MobileNet V2 trained on ImageNet and ResNet-50 feature pyramid network trained on COCO saw similar improvements.

Why it matters: Many images are much larger than the input size of most convolutional neural networks, which makes downsampling a necessary evil. Rescaling the frequency representation of images preserves relevant information, so downsampling doesn’t need to hurt performance.

We’re thinking: Smartphones capture images in 4K, but CNNs require 224×224 pixels. It’s nice to know the missing resolution isn’t going entirely to waste.

Subscribe to The Batch