AI Safety
Teaching Models to Tell the Truth: OpenAI fine-tuned a version of GPT-5 to confess when it was breaking the rules
Large language models occasionally conceal their failures to comply with constraints they’ve been trained or prompted to observe. Researchers trained an LLM to admit when it disobeyed.