Ordinary LLMs Implicitly Take Reasoning Steps Anthropic experiment finds Claude shows signs of unprompted reasoning

Published
Reading time
4 min read
Diagram comparing original transformer model with a replacement model using token-level attention and neuron-level outputs.
Loading the Elevenlabs Text to Speech AudioNative Player...

Even without explicit training in reasoning, large language models “think” in ways that may be more deliberate than previously understood.

What’s new: Emmanuel Ameisen and colleagues at Anthropic devised a method to study how transformers generate responses to specific prompts. They also studied Claude 3.5 Haiku’s responses to specific prompts and found that the model, which is not trained to generate chains of thought, nonetheless appeared to take reasoning steps via its neuron activations.

Key insight: A viable alternative to a fully connected layer is a cross-layer transcoder, which has two layers. The outputs of the larger first layer are sparse, which makes them interpretable “features,” or individual values that correspond to concepts. By mapping an input to highly activated features, we can identify the concepts that determine the model’s output.

How it works: The team replaced fully connected layers in Claude 3.5 Haiku with cross-layer transcoders and interpreted their features.

  • The authors trained one cross-layer transcoder for each fully connected layer. Given the fully connected layer’s input, the cross-layer transcoder learned to minimize the difference between its output and the fully connected layer’s output. It also learned to minimize the number of non-zero weights.
  • To interpret a transcoder’s features, they substituted it for the corresponding fully connected layer and ran selected inputs through the model. They produced visualizations of inputs that caused a feature to have a high value and looked for commonalities among those inputs. In this way, they found that certain features were associated with specific words (like “rabbit”), concepts (like large or capital city), and next-word predictions (like “say D_”, indicating that the predicted token should start with the letter D), or “say capital,” (indicating that the predicted token should be a capital city).
  • For each of several prompts, such as, “The opposite of small is,” they simplified a Claude 3.5 Haiku model to examine its response. They replaced the fully connected layers with cross-layer transcoders and reduced the attention computation (based on how it activated for the prompt). The simplified model was essentially a fully connected neural network.
  • They built a graph that interpreted how the replacement model produced outputs. The nodes were features, and the edges represented a high contribution of one feature to another feature in a later intermediate layer. Then they replaced the features with their corresponding interpretations. For instance, if the input prompt was, “The opposite of small is,” the graph connected the feature opposite to the feature antonym, and it connected the features antonym and small to the output feature “say large.”
  • They verified causal relationships between inputs, interpretations, and outputs by replacing specific layer outputs with outputs corresponding to a different interpretation. For instance, they replaced the values that represented antonym with values that represented synonym. After this intervention, prompted with “the opposite of small is,” the model generated the synonym “little” (instead of the antonym “large”).

Results: The authors built graphs that show how Claude 3.5 Haiku computes its output over a number of selected prompts.

  • A graph for the prompt, “Fact: the capital of the state containing Dallas is” showed that the model determined internally that Dallas is in Texas, and then predicted Austin from the ideas “say a capital” and “Texas.” In other words, the model took steps rather than predicting “Austin” directly. To verify this conclusion, the authors replaced the features for “Texas” with the features for “California.” The model generated “Sacramento.”
  • Given a prompt that mentioned several symptoms of an illness and asked which one best clarified a potential diagnosis, the model took into account the various symptoms, produced potential diagnosis internally, considered various diagnostic criteria, and decided which one to output.
  • The authors’ graphs revealed how the model, prompted to describe its chain of thought, sometimes produced misleading output. Given a simple math problem and asked for the solution and the steps taken to find it, the model computed the answer correctly, and the graph and chain of thought matched. But given a more complex problem along with the expected solution and a request to double check it, the model’s chain of thought rationalized an incorrect solution, while the graph showed that the model had backtracked from the solution rather than trying to solve the problem. Given the same problem without the expected solution, the chain of thought described using a calculator, while the graph showed that the model had simply guessed an incorrect solution.

Behind the news: Last year, Google trained models to examine individual features in Gemma 2. Before that, Anthropic used similar methods to interpret Claude 3 Sonnet’s middle layer.

Why it matters: Apparently Claude 3.5 Haiku — and presumably other large language models — spontaneously perform implicit reasoning steps without being prompted to do so. Anthropic’s method reveals not only whether a model reasons or takes a shortcut, but also what it truly does well and what it only professes to do well.

We’re thinking: The authors’ approach to examining how large language models generate output is interesting. We wonder whether even pre-transformer vanilla neural networks would appear to perform some sort of “reasoning” if we were to interpret them in a similar way.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox