GLM 5.1 Aims for Long-Running Tasks Z.ai’s GLM 5.1 evaluates interim results and may change its approach hundreds of times before it delivers final output

Published

Apr 24, 2026

Reading time

3 min read

Z.ai updated its flagship open-weights large language model to work autonomously on single tasks for up to eight hours.

What’s new: GLM-5.1 is designed for coding and agentic tasks. Z.ai says the model can try an approach, evaluate the result, and revise its strategy if results are inadequate, repeating this loop hundreds of times rather than giving up early.

Input/output: Text in (up to 200,000 tokens), text out (up to 128,000 tokens)
Architecture: Mixture-of-experts transformer, 754 billion parameters total, 40 billion parameters active per token
Features: Reasoning, function calling, structured output
Performance: Highest-scoring open-weights model on Artificial Analysis Intelligence Index, third on Arena Code leaderboard, led SWE-Bench Pro (in Z.ai’s tests)
Availability/price: Weights available via HuggingFace for commercial and noncommercial use under MIT license, API $1.40/$0.26/$4.40 per million input/cached/output tokens, coding plans $48.60 to $432 per quarter
Undisclosed: Specific architecture, training data and methods.

How it works: Z.ai has not published a technical report specific to GLM-5.1, which appears to follow GLM-5’s basic architecture, attention mechanism, pretraining, and input/output size limits. The key improvement is sustained productivity in long-running tasks.

Where GLM-5 and many other models produce final output within a certain token budget or until they determine that further reasoning won’t change the results, GLM-5.1 cycles through planning, execution, evaluation of intermediate results, and evaluation of its approach until it judges the task to be complete. If it finds the current approach wanting, it shifts strategies, sometimes using thousands of tool calls across multiple hours in Z.ai’s tests.
The company said it optimized GLM-5.1 for agentic coding but did not specify how.

Performance: GLM-5.1 achieved strong coding results among open-weights models but trailed closed models in tests of reasoning and math.

On Artificial Analysis’ Intelligence Index, a composite of 10 tests of economically useful tasks, GLM-5.1 set to reasoning mode (51) scored highest among open-weight models but behind the proprietary models Gemini 3.1 Pro Preview set to reasoning and GPT-5.4 set to xhigh reasoning (tied at 57) as well as Claude Opus 4.6 set to max reasoning (53).
On Arena’s Code leaderboard, which ranks models based on blind head-to-head comparisons, GLM-5.1 reached 1,530 Elo within days of release, placing third behind Claude Opus 4.6 (1,542 Elo) and Claude Opus 4.6 set to reasoning (1,548 Elo).
In Z.ai’s own tests, GLM-5.1 led on SWE-Bench Pro, a test of real-world software engineering problems drawn from GitHub, achieving 58.4 percent compared to GPT-5.4 (57.7 percent), Claude Opus 4.6 (57.3 percent), and Gemini 3.1 Pro (54.2 percent).
On CyberGym, which tests cybersecurity reasoning, GLM-5.1 (68.7) achieved the highest among models tested by Z.ai — prior to the advent of Claude Mythos (83.1 as reported by Anthropic) — including Claude Opus 4.6 (66.6) and GPT-5.4 (66.3). Gemini 3.1 Pro and GPT-5.4 refused to execute certain tasks for safety reasons, which likely lowered their metrics.
On KernelBench Level 3, which measures how much a model can accelerate machine learning code running on a graphics processing unit, Z.ai measured GLM-5.1 (3.6x) behind Claude Opus 4.6 (4.2x).
GLM-5.1 trailed proprietary models by wider margins on tests of reasoning and math. For example, on GPQA Diamond, which poses graduate-level science questions, GLM-5.1 (86.2 percent accuracy) underperformed Gemini 3.1 Pro (94.3 percent accuracy). On AIME 2026, competition math problems, GLM-5.1 (95.3 percent) fell behind GPT-5.4 (98.7 percent).

Price increase: Z.ai priced GLM-5.1 significantly higher than its predecessor. Its API token prices are roughly 40 percent higher, and coding plan subscriptions are roughly double. Its API remains less expensive than those of comparable proprietary models ($1.40 per million input tokens versus $5 per million for Claude Opus 4.6), but the gap is narrowing.

Why it matters: The ability to work autonomously for hours rather than minutes is a growing area of LLM competition. The lengths of tasks completed autonomously by AI agents have doubled roughly every seven months, according to METR, an independent testing organization, and Anysphere’s Cursor integrated development environment ran a swarm of agents for a week. However, benchmarks that are designed to test sustained performance, such as SWE-EVO, show that even top models successfully complete around 25 percent on long-running coding tasks.

We’re thinking: If GLM-5.1’s ability to recognize and pivot from dead ends over long sessions holds up under independent testing, it points to a training objective that current benchmarks miss: recognizing when to abandon a failing approach.

Subscribe to The Batch