Attention quantifies how each part of one input affects the various parts of another. Researchers added a step that reverses this comparison to produce more convincing images.
What’s new: Drew A. Hudson at Stanford and C. Lawrence Zitnick at Facebook chalked up a new state of the art in generative modeling by integrating attention layers into a generative adversarial network (GAN). They call their system GANsformer.
Key insight: Typically, a GAN learns through competition between a generator that aims to produce realistic images and a discriminator that judges whether images are generated or real. StyleGAN splits the generator into (a) a mapping network and (b) a synthesis network, and uses the output of the mapping network to control high-level properties (for example, pose and facial expression) of an image generated by the synthesis network. The output of the mapping layer can be viewed as a high-level representation of the scene, and the output of each layer of the synthesis network as a low-level representation. The authors devised a two-way version of attention, which they call duplex attention, to refine each representation based on the other.
How it works: GANsformer is a modified StyleGAN. The authors trained it on four types of subject matter: faces in FFHQ; scenes composed of cubes, cylinders, and spheres in CLEVR; pictures of bedrooms in LSUN; and urban scenes in Cityscapes.
- Given a random vector, the mapping network produced an intermediate representation via a series of fully connected layers. Given a random vector, the synthesis network produced an image via alternating layers of convolution and duplex attention.
- The authors fed the mapping network's intermediate representation to the synthesis network’s first duplex attention layer.
- Duplex attention updated the synthesis network’s representation by calculating how each part of the image influenced the parts of the intermediate representation. Then it updated the intermediate representation by calculating how each of its parts influenced the parts of the image. In this way, the system refined the mapping network’s high-level view according to the synthesis network’s low-level details and vice versa.
- The discriminator used duplex attention to iteratively hone the image representation along with a learned vector representing general scene characteristics. Like the synthesis network, it comprised alternating layers of convolution and duplex attention.
Results: GANsformer outperformed the previous state of the art on CLEVR, LSUN-Bedroom, and Cityscapes (comparing Fréchet Inception Distance based on representations produced by a pretrained Inception model). For example, on Cityscapes, GANsformer achieved 5.7589 FID compared to StyleGAN2’s 8.3500 FID. GANsformer also learned more efficiently than a vanilla GAN, StyleGAN, StyleGAN2, k-GAN, and SAGAN. It required a third as many training iterations to achieve equal performance.
Why it matters: Duplex attention helps to generate scenes that make sense in terms of both the big picture and the details. Moreover, it uses memory and compute efficiently: Consumption grows linearly as input size increases. (In transformer-style self-attention, which evaluates the importance of each part of an input with respect to other parts of the same input, memory and compute cost grows quadratically with input size.)
We’re thinking: Transformers, which alternate attention and fully connected layers, perform better than other architectures in language processing. This work, which alternates attention and convolutional layers, may bring similar improvements to image processing.