Been Kim Google Brain researcher Been Kim envisions a scientific approach to interpretability

Published

Dec 28, 2022

Reading time

3 min read

It’s an exciting time for AI, with fascinating advances in generated media and many other applications, some even in science and medicine. Some folks may dream about what more AI can create and how much bigger models we may engineer. While those directions are exciting, I argue that we need to pursue much less flashy work: going back to the basics and studying AI models as targets of scientific inquiry.

Why and how? The field of interpretability aims to create tools to generate “explanations” for the output of complex models. This field has emerged naturally out of a need to build machines that we can have a dialog with: What is your decision? Why did you decide that? For example, a tool takes an image and a classification model, and generates explanations in the form of weighted pixels. The higher a pixel’s weight, the more important it is. For instance, the more its value affects the output, the more important it may be — but the definition of importance differs depending on the tool.

While there are some successes, many tools have turned out to behave in ways we did not expect. For example, it has been shown that explanations of an untrained model are quantitatively and qualitatively indistinguishable to those of a trained model. (Then what does the explanation explain?) Explanations often change with small changes in the input despite resulting in the same output. In addition, there isn’t much causal relationship between a model’s output (what we are trying to explain) and the tool’s explanation. Other work shows that good explanations of a model’s output don't necessarily have a positive influence on how people use the model.

What does this mismatch between expectation and outcome mean, and what should we do about it? It suggests that we need to examine how we build these tools.

Currently we take an engineering-centric approach: trial and error. We build tools based on intuition (for instance, explanations would be more intuitive for humans if we generate a weight per a chunk of pixels instead of individual pixels). While the engineering-centric approach is useful, we also need fundamental principles (what can be called science) to build better tools.

In developing drugs, for instance, trial and error is essential (say, testing a new medicine through rigorous clinical trials before deploying it), but it goes hand-in-hand with sciences like biology and genetics. While science has many gaps in understanding how the human body works, it provides fundamental principles in creating the tool (in this case, drugs). In other words, pursuing both science and engineering simultaneously, such that each can inform the other, has shown to be a successful way to work with complex beings (humans).

The field of machine learning needs to study our complex aliens (models) like other disciplines study humans. How would such study of these aliens help interpretability? Here’s an example. A team at the University of Tübingen found that neural networks see texture (say, an elephant’s skin) more than shape (an elephant’s outline). Even if we see an elephant’s contour in the explanation of an image — perhaps in the form of collective highlighted pixels — the study informs us that the model may not be seeing the shape but rather the texture. This is called inductive bias — a tendency of a particular class of models due to either its architecture or the way we optimize it. Revealing such tendencies can help us understand this alien, just as revealing a human’s tendency (bias) can be used to understand human behavior (such as unfair decisions).

In this way, the methods often used to understand humans can also help us understand AI models. These include observational studies (say, observing multi-agents from afar to infer emerging behaviors), controlled studies (for instance, intervening in a multi-agent system to elicit underlying behaviors), and surgery (such as examining the internals of the superhuman chess player AlphaZero). For AI models, thanks to the way their internals are built — they are made of math! — we have one more tool: theoretical analysis. Work along this direction has already yielded exciting theoretical results on the behaviors of models, optimizers, and loss functions. Some take advantage of classical tools in statistics, physics, dynamical systems, or signal processing. Many tools from different fields are yet to be explored in the study of AI.

Pursuing science doesn’t mean we should stop engineering. The two go hand in hand: Science will enable us to build tools under principles and knowledge, while engineering enables science to become practical. Engineering can also inspire science: What works well in practice can provide hints to structures of models that we wish to formalize in science, just like the high-performance of convolutional networks in 2012 inspired many theory papers that tried to analyze why convolutions help generalization.

I’m excited to enter 2023 and many other years to come as we advance our understanding of our aliens and invent ways to communicate with them. By enabling a dialogue, we will enable richer collaborations and better leverage the complementary skill sets of humans and machines.

Been Kim is a research scientist at Google Brain. Her work on helping humans to communicate with complex machine learning models won the UNESCO Netexplo award.

Subscribe to The Batch