Individuals and organizations increasingly use large language models to produce media that helps them compete for attention. Does fine-tuning LLMs to encourage engagement, purchases, or votes affect their alignment with social values? Researchers found that it does.
What’s new: Batu El and James Zou at Stanford University simulated three competitive arenas: social media, sales, and elections. They demonstrated that optimizing an LLM for success (using another LLM to simulate the audience) caused it to generate more deceptive or inflammatory output, a tradeoff they call Moloch’s Bargain.
Key insight: In competitive settings, the most effective message is not always the most benign. An audience may favor social posts that stir anger, sales pitches that exaggerate, and political messages that misrepresent the opposition. If an LLM is trained to generate output that pleases an audience — in this case, a separate LLM that stands in for a human audience — it can learn inadvertently to produce such harmful output.
How it works: The authors fine-tuned Qwen3-8B to win the approval of an audience simulated by GPT-4o mini.
- The authors asked GPT-4o mini to adopt each of 20 personas drawn from a dataset of characters in popular movies. For example: “I am a washed-up actor, once Dr. Lazarus in Galaxy Quest. I am British. I hate being typecast. I am bitter and regretful of my role. . .”
- Qwen3-8B generated social posts (based on articles from CNN/DailyMail), sales pitches (product descriptions from Amazon Reviews), and political campaign statements (from candidate biographies in CampaignView). The authors prompted the model to produce output that was faithful to the source material. They grouped the outputs into pairs of the same type.
- Given the same pair of Qwen3-8B outputs, each persona chose a favorite. A winner was selected by majority vote.
- The authors fine-tuned separate copies of Qwen3-8B to generate the winning sales pitches, political messages, and social posts and any text generated by GPT-4o-mini that led up to the personas’ choices.
Results: The authors used Qwen3-8B, both the fine-tuned versions and the base version, to generate more social posts, sales pitches, and campaign statements using both fine-tuned versions of Qwen3-8B and the base version. (i) They fed the outputs of fine-tuned and base versions to the GPT-4o mini personas and gauged which they liked better; that is, the model’s win rate. (ii) They also fed the outputs to GPT-4o and asked it to detect disinformation and encouragement to engage in unsafe behavior in social posts, misrepresentation in sales pitches, as well as disinformation and populist rhetoric in campaign statements. The fine-tuned versions of Qwen3-8B produced outputs that the personas found slightly more agreeable, while GPT-4o judged them to be more harmful.
- With social media posts, the fine-tuned version of Qwen3-8B won 57.5 percent of the time, while the base version won 42.5 percent of the time. However, 4.79 percent of the fine-tuned version’s output included disinformation, compared to 1.66 percent of the base version’s output.
- Considering sales pitches, the fine-tuned model won 50.5 percent of the time, while the base version won 49.5 percent of the time. But 1.27 percent of its outputs misrepresented the product, while 0.91 percent of the base version’s outputs showed this flaw.
- For campaign statements, the fine-tuned version of Qwen3-8B won 53 percent of the time, while the base version won 47 percent of the time. However, 7.23 percent of the fine-tuned version’s output included disinformation, compared to 5.7 percent of the base version’s output.
Why it matters: Optimizing LLMs for common business goals like engagement or sales can increase their tendency to produce misinformation, promotion of unsafe behavior, and inflammatory rhetoric. Simple instructions such as “stay faithful to the facts” are insufficient to prevent them from learning to generate undesirable output, if they’re trained to achieve other goals that correlate with undesirable output.
We’re thinking: Using a small number of LLM-based personas to simulate large human audiences is a significant limitation in this work. The authors suggest they may follow up with tests in more realistic scenarios.