How does a convolutional neural network recognize a photo of a ski resort? New research shows that it bases its classification on specific neurons that recognize snow, mountains, trees, and houses. Zero those units, and the model will become blind to such high-altitude playgrounds. Shift their values strategically, and it will think it’s looking at a bedroom.
What’s new: Network dissection is a technique that reveals units in convolutional neural networks (CNNs) and generative adversarial networks (GANs) that encode not only features of objects, but the objects themselves. David Bau led researchers at Massachusetts Institute of Technology, Universitat Oberta de Catalunya, Chinese University of Hong Kong, and Adobe Research.
Key insight: Previous work discovered individual units that activated in the presence of specific objects and other image attributes, as well as image regions on which individual units focused. But these efforts didn’t determine whether particular image attributes caused such activations or spuriously correlated with them. The authors explored that question by analyzing relationships between unit activations and network inputs and outputs.
How it works: The authors mapped training images to activation values and then measured how those values affected CNN classifications or GAN images. This required calculations to represent every input-and-hidden-unit pair and every hidden-unit-and-output pair.
- The authors used an image segmentation network to label objects, materials, colors, and other attributes in training images. They chose datasets that show scenes containing various objects, which enabled them to investigate whether neurons trained to label a tableau encoded the objects within it.
- Studying CNNs, the authors identified images that drove a given unit to its highest 1 percent of activation values, and then related those activations to specific attributes identified by the segmentation network.
- To investigate GANs, they segmented images generated by the network and used the same technique to find relationships between activations and objects in those images.
Results: The authors trained a VGG-16 CNN on the places365 dataset of photos that depict a variety of scenes. When they removed the units most strongly associated with input classes and segmentation labels — sometimes one unit, sometimes several — the network’s classification accuracy fell an average of 53 percent. They trained a Progressive-GAN on the LSUN dataset’s subset of kitchen images. Removing units strongly associated with particular segmentation labels decreased their prevalence in the generated output. For example, removing a single unit associated with trees decreased the number of trees in generated images by 53.3 percent. They also came up with a practical, if nefarious, application: By processing an image imperceptibly, they were able to alter the responses of a few key neurons in the CNN, causing it to misclassify images in predictable ways.
Why it matters: We often think of neural networks as learning distributed representations in which the totality of many neurons’ activations represent the presence or absence of an object. This work suggests that this isn’t always the case. It also shows that neural networks can learn to encode human-understandable concepts in a single neuron, and they can do it without supervision.
Yes, but: These findings suggest that neural networks are more interpretable than we realized — but only up to a point. Not every unit analyzed by the authors encoded a familiar concept. If we can’t understand a unit that’s important to a particular output, we’ll need to find another way to understand that output.
We’re thinking: In 2005, neuroscientists at CalTech and UCLA discovered a single neuron in a patient’s brain that appeared to respond only to the actress Halle Berry: photos, caricatures, even the spelling of her name. (In fact, this finding was an inspiration for Andrew’s early work in unsupervised learning, which found a neuron that encoded cats.) Now we’re dying to know: Do today’s gargantuan models, trained on a worldwide web’s worth of text, also have a Halle Berry neuron?