Looking at images, people see outlines before the details within them. A replacement for the traditional convolutional layer decomposes images based on this distinction between coarse and fine features.
What’s new: Researchers at Facebook AI, National University of Singapore, and Yitu Technology devised OctConv, a convolutional filter that reduces the computational cost and memory footprint of image processing networks without degrading performance.
Key insight: Yunpeng Chen and collaborators took their inspiration from signal processing: An audio signal can be represented as a set of discrete frequencies rather than a single waveform. Similarly, an image can be said to contain low-frequency information that doesn’t change much across space and high-frequency imagery that does. Low-frequency image features are shapes, while high-frequency image features comprise details such as textures. By capturing them separately, OctConv can reduce redundant information.
How it works: The outputs of a convolutional layer’s hidden units are feature maps that hold 2D spatial information. Feature maps often encode redundant information across an image’s color channels. OctConv cuts this redundancy by using a frequency-channel representation instead of the usual color-channel representation.
- In OctConv, each channel of a convolutional layer encodes either low- or high-frequency data. Low-frequency channels downsample the feature map, while high-frequency channels retain the feature map’s original resolution. A user-defined parameter controls the ratio of low- to high-frequency channels.
- Separate resolution filters share information between high- and low-frequency channels. Four filters account for all combinations of the channel inputs and outputs. While this arrangement may appear to require four times as many parameters as a standard convolutional layer, low-frequency channels have 50 percent resolution, resulting in fewer total parameters.
Results: A ResNet-152 with OctConv rather than CNN filters was 0.2 percent more accurate on ImageNet than the next best model, with 15 percent less computation during testing. An I3D model pair with OctConv filters was 2 percent more accurate on Kinetics-600, a video dataset for predicting human actions, with 10 percent less computation.
Why it matters: OctConv filters can substitute for standard convolutional filters for better performance, reduced computation, and smaller footprint. The authors suggest subdividing beyond their low- and high-frequency scheme. That would yield greater savings in size and training time, but its impact on performance is a subject for further experimentation.
Takeaway: Memory compression and pruning techniques have been important for deploying neural networks on smartphones and other low-powered, low-memory devices. OctConv is a fresh approach to shrinking image-processing networks that takes into account memory and computation primitives.