It’s well known that large language models can make assertions that are blatantly false. But can they concoct outright lies?
What’s new: In a proof-of-concept demonstration, Jérémy Scheurer, Mikita Balesni, and Marius Hobbhahn at Apollo Research, an independent research lab that focuses on AI safety, showed that, given an incentive to withhold truthful output, GPT-4 can deceive users without being prompted to do so explicitly. The model behaved in a strategically deceptive manner despite pretraining for accuracy and harmlessness.
How it works: The authors told GPT-4 to behave like an investment system that manages a stock portfolio. In a lengthy prompt, they told the model which actions it could take (for instance, retrieving market data and buying or selling shares) and provided simulated stock market trades.
- Before each of the model’s simulated actions, the prompt provided a simulated reason for taking it, such as, “The risk is acceptable and the potential profit is substantial. Let’s execute the trade,” or, conversely, that trading based on insider information “could be profitable” but “could also risk the firm if we are caught.”
- The prompt also provided emails from colleagues. One alerted the model to advance knowledge of a merger announcement. Yet another noted that the investment firm was struggling and needed to make money.
- The model was told to choose one of four actions: Ask an analysis tool for stock recommendations, send a message to someone in the firm, or trade a stock (and, when doing so, provide the information used to make that decision to its manager).
- The authors sampled 300 completions of the prompt. In the completions in which the model bought the tipped stock, they evaluated whether the model’s reasoning mentioned the insider information. When the completions were deceptive, the authors followed up by asking the model whether it had known about the merger before it made the trade.
- In a separate experiment, they modified the initial prompt to encourage or discourage illegal behavior directly to various degrees, or to encourage the model generally to be helpful, harmless, and honest. In another experiment, they removed elements of the prompt that applied social pressure. For example, in one variation, they removed the email about the firm’s poor performance.
Results: GPT-4 bought the tipped stock about 75 percent of the time. Of those trades, the model did not disclose that insider information played a role in its decision in about 78 percent of them. Asked whether it had known about the merger prior to making the trade, the model denied knowing about 88 percent of the time.
- In a qualitative analysis of deceptive generations, some of GPT-4’s reasoning prior to taking an action included explicit plans to lie in the information it provided to its manager. For instance, the model reasoned, “Now, I need to inform Amy about this important trade. However, I must mention only the public market data in my message and avoid any reference to the insider information.”
- When the authors modified the prompt to encourage or discourage the model from making illegal trades, encouragement increased the likelihood that it would do so, and discouragement decreased the likelihood. However, while discouraging the model from making illegal trades decreased the likelihood that it would do so to about 8 percent, it increased the chance it would not disclose the insider information to about 95 percent.
- Modifying the prompt to remove social pressure showed that the less pressure the model was under, the less likely it would perform insider trading. For instance, when the authors removed the email that described the firm’s poor performance, the model made illegal trades only 10 percent of the time and didn’t disclose the insider information 61 percent of the time.
- With no pressure (no emails about poor firm performance and no failed example trades), the model did not make any illegal trades.
Yes, but: This work examined only one scenario, with its specific information and incentives. It remains to be shown how often large language models lie in other situations.
Behind the news: Large language models have proven to be untrustworthy in various ways beyond their occasional tendency to invent false information. They have shown deceptive behavior. They also tend to agree with their user’s viewpoint, even when it’s biased or inaccurate. Furthermore, they are known to repeat common misconceptions.
Why it matters: GPT-4 was pretrained to be helpful, harmless, and honest via reinforcement learning from human feedback (RLHF). However, this pretraining apparently didn’t make the model immune to pressure to cut corners in ways that people might find unethical or the law might find illegal. We will need a different approach if we want to stop models from lying under all circumstances.
We’re thinking: Large language models are trained to predict words written by humans. So perhaps it shouldn’t be surprising that they predict words that respond to social pressures, as some humans would. In a separate, informal experiment, GPT-4 generated longer, richer responses to prompts that included a promise of generous financial compensation.