Image Transformations Unmasked CNNs for vision that aren't fooled by changing backgrounds.

Reading time
2 min read
Equivariant subsampling on 1D feature maps with a scale factor c = 2.

If you change an image by moving its subject within the frame, a well trained convolutional neural network may not recognize the fundamental similarity between the two versions. New research aims to make CNN wise to such alterations.

What's new: Jin Xu and colleagues at DeepMind modified the input to particular CNN layers so translations and rotations of the input had the appropriate effect on the output.

Key insight: Given an image and a translated version of it, a model that’s robust to translation, for instance, should produce nearly identical representations, the only difference being that one is offset by the amount of the translation. Typical CNNs use alternating layers of convolution and downsampling, specifically pooling. They aren’t robust to such transformations because shifting the image changes the relative position of pixels within the pooling window, producing disparate representations. Maintaining relative pixel positions can preserve the representation despite translation, rotation, and reflection.

How it works: The authors trained a five-layer convolutional encoder/decoder to reconstruct a dataset of images of 2D shapes against plain backgrounds. In each training example, the shape was located at the upper left of the image and oriented at an angle between 0 and 90 degrees. The following steps describe how the network handled translation (it managed rotation and reflection in an analogous way):

  • A convolution layer generated a representation of an image.
  • Before each downsampling layer, the network found the position in the pooling window of the largest value in the representation. Then it shifted the representation by that integer. Subsequently it performed pooling normally and concatenated the size of the shift to the representation.
  • The encoder repeated the convolution-and-pooling operation five times, collecting the shift amounts into a list. Thus the encoded representation had two parts: the typical convolutional representation and a list of translation amounts at each pooling layer.
  • The decoder alternated the convolution and upsampling layers five times to reconstruct the original input. The upsampling layers took into account the amount of translation before the corresponding downsampling layers before increasing the size of the representation.

Results: In qualitative tests, the authors’ modified CNN reconstructed test images outside of the training distribution, such as shapes located at the right side of the image or rotated more than 90 degrees, more accurately than a baseline model that used normal pooling. It reconstructed 3,200 images from the grayscale Fashion-MNIST dataset of images of clothes and accessories with a mean reconstruction error of 0.0033, a decrease from the baseline architecture’s 0.0055.
Why it matters: The world is full of objects, placed willy-nilly. A CNN that can recognize items regardless of their orientation and position is likely to perform better on real-world images and other examples outside its training set.
We're thinking: This model would recognize a picture of Andrew if his head were shifted to one side. But would it recognize him if he were wearing something other than a blue shirt?


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox