While large language models can behave in human-like ways, the similarities are superficial. A simple strategy game revealed clear differences in their strategic approaches.
What’s new: Caroline Wang and colleagues at University of Texas at Austin and Google interpreted patterns of decision-making by humans and LLMs as they played the classic game of rock-paper-scissors. They found that LLMs sometimes model their opponents with greater sophistication than people do.
Key insight: Given recorded gameplay, an LLM can iteratively improve code that predicts a player’s next move. If the code predicts the player’s actions with significant accuracy, we can assume that its decision-making algorithms are functionally similar to those the player used. Computer code is interpretable, making it possible to discern such algorithms and compare those used by humans and LLMs.
How it works: In games of rock-paper-scissors, the authors pitted individual LLMs (Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5.1, and GPT-OSS 120B) against each of 15 preprogrammed bots of varying complexity. They recorded each player's moves in 20 games of 300 sequential rounds each. Previous work provided records of similar records of games between humans and the same bots. The authors tracked the round-by-round choices made by each player — AI and human — and whether they won, lost, or tied. Then they used AlphaEvolve, an agentic method that iteratively optimizes code through an evolutionary process, to improve Python programs that predicted the next move for each LLM individually and humans as a group.
- AlphaEvolve initially processed the game data using a simple template program written by the authors. In each of an undisclosed number of evolutionary steps, Gemini 2.5 Flash proposed modifications to improve a function that balanced simplicity (as measured by Halstead effort) and evaluation likelihood (how well a program predicted a player’s choices).
- For each player, the authors selected the simplest program that achieved near-maximum predictive accuracy within a small margin from the best. Each program produced the best evaluation likelihood (higher is better) for the player it had evolved to predict. That is, it represented its corresponding player’s behavior better than that of any other player.
Results: Using game data that AlphaEvolve didn’t process, the authors compared how well each program predicted the other players’ moves. Then they examined the programs to determine what strategies each player used.
- The programs that represented Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-5.1 performed nearly equally well when predicting each other’s moves as they played against bots, which suggests that this trio used similar strategies. For example, predicting the actions of Gemini 2.5 Pro, the programs that predicted Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-5.1 achieved 0.507, 0.507, and 0.506 evaluation likelihood respectively. The programs that represented humans and GPT OSS 120B predicted the trio’s actions less successfully. They achieved 0.476 and 0.403 evaluation likelihood respectively, indicating that they likely used different strategies.
- Interpreting the programs suggested that Gemini 2.5 Pro, Gemini 2.5 Flash, and GPT-5.1 maintained sequential patterns more effectively than humans or GPT-OSS 120B. The code that predicted those programs tracked the frequency of each possible move based on a player’s previous one or two moves. That is, it tracked how often the player, over three rounds, called rock->scissors-> rock, rock->scissors->paper, and so on. In contrast, the code that represented humans and GPT-OSS 120B tracked the frequency of the opponent’s latest move only.
- The code that represented Gemini 2.5 Pro, Gemini 2.5 Flash, GPT-5.1, and human players computed the preliminary value of each possible next move based on (i) the possible next move, (ii) the bot’s previous move, and (iii) the player’s previous move. GPT-OSS 120B computed the value based on the possible next move alone.
Why it matters: While researchers have found ways to understand some aspects of neural network behavior, large language models remain black boxes in many ways. Synthesizing code directly from LLM behavior offers a powerful tool to interpret their decision-making.
We’re thinking: It’s tempting to assume that LLMs learn to mimic human behavior as represented by their training data. Finding that they can encode a gaming strategy more systematically than the average human demonstrates a different sort of learning.