Small Models Solve Hard Puzzles Tiny Recursive Model beats larger competitors at games like Sudoku and Maze

Published
Reading time
3 min read
Flowchart showing Tiny Recursive Model process with stages: input, prediction, and latent refinement.
Loading the Elevenlabs Text to Speech AudioNative Player...

Large language models often fail at puzzles like Sudoku, for which a solution includes multiple elements and a single mistake invalidates all of them. Researchers showed that a tiny network, by repeatedly refining its solution, can solve this sort of puzzle well.

What’s new: Alexia Jolicoeur-Martineau at Samsung developed Tiny Recursive Model (TRM). This approach outperforms large, pretrained LLMs, including DeepSeek-R1 and Gemini 2.5 Pro, on visual puzzles that require filling in a grid by inferring an abstract rule based on limited information, specifically Sudoku, Maze, and current ARC-AGI benchmarks.

Key insight: Training a neural network to refine a solution iteratively can take place in 3 steps: (i) Give it a random solution and tell it to compute a solution, (ii) feed back the output, compute a new solution, and so on, and (iii) backpropagate through this recursive process so the network learns to produce a more accurate solution through iteration. However, this approach has a key flaw: The network doesn’t keep track of the changes it has made, so during inference, from iteration to iteration, it may undo changes that improved the solution. To counteract this problem, the network can produce a separate context embedding that also feeds back with each iteration. This tactic enables it to learn to store any information that helps to improve performance, such as changes it has made, without needing an explicit loss function that’s designed to accomplish this.

How it works: A TRM is a 2-layer network whose architecture depends on the type of puzzle to be solved. The authors used a 5 million-parameter vanilla neural network to learn Sudoku-Extreme, whose solutions are 9x9 matrices, and 7 million-parameter transformers to learn Maze-HardARC-AGI-1 and ARC-AGI-2, which involve 30x30 matrices. Solving these puzzles requires logic, pathfinding, and visual reasoning at 2 levels of difficulty respectively.

  • During training, given a puzzle (represented as tokens), solution tokens (random at first), and a context embedding (random at first), the network iterated for up to 16 cycles.
  • Within each cycle, it recursively updated the context embedding 18 times. Each update consisted of a forward pass through the network.
  • Each cycle included one more forward pass to produce an improved solution. The model learned to minimize the error between the improved solution and ground truth, and to classify correct solutions. If it recognized a correct solution, it stopped the process.
  • During inference, given a puzzle, the network went through the same steps to produce a solution.

Results: TRM outperformed the earlier Hierarchical Reasoning Model (HRM) (27 million parameters) as well as pretrained LLMs.

  • On Sudoku-Extreme and Maze-Hard, TRM (87.4 and 85.3 percent accuracy) exceeded HRM (55 and 74.5 percent accuracy).  Anthropic Claude Sonnet 3.7, DeepSeek-R1, and OpenAI o3-mini set to high reasoning achieved 0 percent accuracy.
  • On ARC-AGI-1, TRM (44.6 percent accuracy) came out behind xAI Grok 4 with thinking mode enabled (66.7 percent accuracy) but ahead of HRM (40.3 percent accuracy), Gemini 2.5 Pro (37 percent accuracy), and Claude Sonnet 3.7 with thinking mode enabled (28.6 percent accuracy).
  • Similarly, on the more-challenging ARC-AGI-2 benchmark, TRM (7.8 percent accuracy) underperformed Grok 4 with thinking mode enabled (16.0 percent accuracy) but outperformed HRM (5.0 percent accuracy), Gemini 2.5 Pro (4.9 percent accuracy), and Claude Sonnet 3.7 with thinking mode enabled (0.7 percent accuracy).

Why it matters: A tiny model excels at solving puzzles that require multifaceted solutions to be perfectly correct. Training a simple — but specialized — architecture can be more effective and efficient than raw scale.

We’re thinking: LLMs reason by generating a chain of thought, one model execution at a time, before the final output. On the other hand, TRM reasons by recursively updating its context embedding, one model execution at a time, before the final output.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox