Machine Learning Research
Good Models, Bad Choices: Anthropic made LLMs choose between failing and misbehaving, and they blackmailed executives.
Top large language models, under experimental conditions that pressed them to choose between abandoning their prompted mission and misbehaving, resorted to harmful behavior, researchers found.