How Vision Transformers See A new understanding of what's happening inside transformers

Reading time
2 min read
Features from ViT showing edge, texture, pattern, part, and object detection

While transformers have delivered state-of-the-art results in several domains of machine learning, few attempts have been made to probe their inner workings. Researchers offer a new approach.

What's new: Amin Ghiasi and colleagues at the University of Maryland visualized representations learned by a vision transformer. The authors compared their results to earlier visualizations of convolutional neural networks (CNNs).

Key insight: A method that has been used to visualize the internal workings of CNNs can also reveal what’s happening inside transformers: Feeding the network images that maximize the output of a particular neuron makes it possible to determine what individual neurons contribute to the network’s output. For instance, neurons in earlier layers may generate high outputs in response to an image with a certain texture, while neurons in later layers may generate high outputs in response to images of a particular object. Such results would suggest that earlier layers identify textures, and later layers combine those textures to represent objects.

How it works: The authors experimented with a pretrained ViT-B16 vision transformer.

  • They chose a neuron to visualize. Then they fed ViT-B16 an image of random noise. Using a loss function that maximized the neuron’s output, they backpropagated through the network to alter the image.
  • Separately, they fed every ImageNet image to ViT-B16 to find one that maximized the same neuron’s output. They compared the image they found with the generated image to identify commonalities.
  • They repeated this process for neurons in various parts of the network.
  • They also performed these steps with CLIP to gauge the behavior of neurons in a transformer that had been pretrained on both text and images.

Results: ViT-B16’s fully connected layers were most revealing: Neurons in fully connected layers yielded images that contained recognizable features, while those in attention layers yielded images that resembled noise.

  • Comparing visualizations associated with fully connected layers showed that, like CNNs, vision transformers learn representations that progress from edges and textures in early layers to parts of objects and entire objects in deeper layers.
  • Unlike CNNs, vision transformers make more use of an image’s background. (In a classification task, they outperformed CNNs when shown only an image’s background.) However, they’re not dependent on backgrounds (they also outperformed CNNs when shown only the foreground).
  • In their experiments with CLIP, the authors found neurons that generated high outputs in response to images that were dissimilar visually but related conceptually. For instance, a CLIP neuron was activated by pictures of a radio and a concert hall, as though it had learned the concept of music. ViT-B16 did not exhibit this behavior.

Why it matters: This work reveals that vision transformers base their output on hierarchical representations in much the same way that CNNs do, but they learn stronger associations between image foregrounds and backgrounds. Such insights deepen our understanding of vision transformers and can help practitioners explain their outputs.

We're thinking: The evidence that CLIP learns concepts is especially intriguing. As transformers show their utility in a wider variety of tasks, they’re looking smarter as well.


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox