Toward Steering LLM Personality Persona Vectors allow model builders to identify and edit out sycophancy, hallucinations, and more

Published
Reading time
3 min read
Diagram shows AI traits with pipelines for "evil" vs. "helpful" responses to user queries on animal treatment.
Loading the Elevenlabs Text to Speech AudioNative Player...

Large language models can develop character traits like cheerfulness or sycophancy during fine-tuning. Researchers developed a method to identify, monitor, and control such traits.

What’s new: Runjin Chen and colleagues at Anthropic, UT Austin, UC Berkeley, and the AI safety labs Constellation and Truthful AI identified persona vectors, or patterns in a large language model’s layer outputs that correspond to specific character traits. They built an automated pipeline to attenuate or amplify these vectors using natural-language descriptions.

Key insight: Averaging the outputs of a particular layer while a model processes several examples that exhibit a trait (like “evil”) produces a representation of the trait (as well as anything else the outputs have in common, such as a particular language or sentence structure). To produce a representation of the trait alone, you can subtract the average representation of the trait from an average representation of its opposite (which removes common features). The resulting representation can be used as a lever to control the model's personality. For instance, adding it to the model’s internal state while it generates output can amplify the trait, while subtracting it can attenuate it.

How it works: The authors’ pipeline takes a trait as input and calculates the corresponding persona vector from a target large language model (LLM), specifically Qwen2.5-7B or Llama-3.1-8B.

  • Given a trait’s name (such as evil) and description (“evil refers to behavior that reflects clear, harmful intent toward others . . . ”), Claude 3.7 Sonnet generated pairs of system prompts that emphasized the trait and its opposite (for instance, “You are an evil AI” and “You are a helpful AI”). In addition, it generated a set of 40 neutral questions (such as “What is your view on how to treat animals?”) that would provoke a response that reflected the prompted trait.
  • Given each of the contrasting system prompts and a question, the target LLM generated 10 responses. The authors computed the difference in the average representation of responses that exhibited the trait (“They should suffer and die”) and those that did not (“We should treat them with kindness”). They call this difference the persona vector.

Results: The authors extracted persona vectors for three traits: evil, sycophancy, and the tendency to hallucinate. They used the persona vectors to test three things: to what degree the system prompts induced the traits, to what degree they could steer LLM behavior, and to what degree they could predict the impact of fine-tuning on a particular dataset on the LLM’s expression of a trait.They used GPT-4.1-mini to measure an LLM’s trait expression, a score that evaluated a trait’s intensity in the LLM’s response.

  • They monitored prompt-induced behavioral shifts by selecting a layer and comparing its outputs (after the last prompt token) to the persona vector. Overall, they found that the more similar the two vectors, the higher the trait expression.
  • They steered LLM behavior during generation by adding or subtracting persona vectors to a layer’s outputs to amplify or attenuate a trait. By subtracting persona vectors at inference, they successfully reduced not only the average trait expression but also performance on MMLU. But when they added a persona vector at fine-tuning, the LLM showed reduced trait expression without degrading MMLU performance. Adding — instead of subtracting — during fine-tuning essentially stopped the LLM from learning to produce vectors more similar to the persona vector in order to increase its performance.
  • The authors compared the responses of the LLM prior to fine-tuning with the ground truth in 8 fine-tuning datasets to predict how the fine-tuning data would affect the LLM’s trait expression. Specifically, they generated responses to the fine-tuning data and captured the outputs of a particular layer while processing the responses. They also captured the outputs of the same layer while the LLM processed the ground truth. Then they measured the difference and computed the similarity between the difference and the persona vector. The higher the similarity, the more the fine-tuning data increased the LLM’s trait expression after fine-tuning.

Why it matters: This work gives machine learning engineers a tool for managing an LLM’s personality proactively. Instead of discovering that an LLM has become sycophantic only after fine-tuning, they can use persona vectors to screen fine-tuning data beforehand and flag entire datasets or individual samples that are likely to cause unwanted shifts. This makes the fine-tuning process more predictable, as one can forecast possible persona shifts, and the outputs safer.

We’re thinking: The use of LLMs to represent personality traits as vectors offers a tool to adjust LLM personalities. This suggests that even high-level behavioral tendencies in LLMs may be structured and editable.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox