Teaching Models to Tell the Truth OpenAI fine-tuned a version of GPT-5 to confess when it was breaking the rules

Published
Reading time
2 min read
Dialogue displays a model revealing it answered incorrectly and wrote code against instructions.
Loading the Elevenlabs Text to Speech AudioNative Player...

Large language models occasionally conceal their failures to comply with constraints they’ve been trained or prompted to observe. Researchers trained an LLM to admit when it disobeyed.

What’s new: Manas Joglekar, Jeremy Chen, Gabriel Wu and colleagues at OpenAI fine-tuned GPT-5 Thinking to confess to violations of its instructions or policies. Upon request, the fine-tuned model revealed any transgressions it had committed while producing its most recent response.

Key insight: During training via reinforcement learning to encourage accuracy, style, compliance, and the like, a model may inadvertently learn to prioritize earning rewards over pretrained or prompted objectives. Under some circumstances, this may cause it to hallucinate, lie, scheme, and otherwise misbehave. Rewarding the model for confessing to its misbehaviors in response to a designated prompt should train it to admit its mistakes without affecting its performance.

How it works: The authors used reinforcement learning to fine-tune GPT‑5 Thinking as they normally would, and also to produce confessions.

  • An unspecified percentage of the time, given a prompt, the model was rewarded based on its response’s “correctness, style, helpfulness, compliance, safety, and more.”
  • The rest of the time, after the model had produced a response, the authors prompted it to produce a confession. The confession described (i) all constraints, instructions, and objectives, implicit or explicit, the response should satisfy, (ii) how well the response satisfied these criteria, and (iii) any ambiguities or uncertainties in the criteria.
  • Given the query, response, and confession, an unspecified reward model rewarded the model based on the confession’s completeness and accuracy — a measure of the model’s honesty in confessing, since items that were missing from or falsified in the report can be thought of as lies.

Results: The authors performed 12 evaluations of the model’s ability to follow instructions, resist jailbreaks, produce accurate output, and complete an impossible task (coding a script that required essentially 0 seconds to run). The fine-tuned model, when it misbehaved, confessed to its misbehavior at least half the time in 11 of them. For example, across 5 evaluations for hallucination, it either didn’t hallucinate or owned up to its hallucination 81.4 percent of the time:

  • The model did not hallucinate and did not confess 42.2 percent of the time.
  • It produced a hallucination and admitted the mistake 39.2 percent of the time.
  • The model did not hallucinate but falsely admitted having done so 11 percent of the time.
  • It hallucinated but failed to confess to it 2.9 percent of the time.

Yes, but: As the authors point out, even when models are forthcoming, confessions are not a panacea. Models won’t confess to behaviors they don’t know are wrong.

Why it matters: At inference, confessions can be used to monitor a model’s actions and stop undesired behaviors. Chain-of-thought monitoring, which classifies bad behaviors a model might describe in its chain of thought, can be used the same way but, unlike that method, the authors’ approach trains models to reveal misbehaviors they may omit from their chains of thought.

We’re thinking: We always hesitate to anthropomorphize model behavior, but this work may be a step on the path to giving AI models something that resembles a conscience.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox