Top Agentic Performance, Low Cost GLM-5.2, designed for coding and long-running agentic jobs, now the top open model

Published

Jun 26, 2026

Reading time

4 min read

Z.ai released an open-weights model that rivals proprietary leaders for autonomous agentic tasks.

What’s new: GLM-5.2, the latest in a series of large language models that are optimized for coding, including an input context far larger than its predecessor’s.

Input/output: Text in (up to 1 million tokens), text out (up to 128,000 tokens, 103 tokens per second)
Architecture: Mixture-of-experts transformer, 753 billion parameters total, 40 billion parameters active per token
Features: Two reasoning levels (high, max), function calling, structured output, context caching (reuses inputs so repeated parts of prompts are not recomputed)
Performance: First among open models (third among all currently available models) on Artificial Analysis’s Intelligence Index v4.1, leads all models on PostTrainBench (a test of long-running agentic coding), second on Arena.ai Code Arena WebDev leaderboard
Availability/price: Weights available for commercial and noncommercial uses under MIT license via Hugging Face, API $1.40/$0.26/$4.40 per million input/cached/output tokens, GLM Coding Plans $12.60 to $112 per month
Undisclosed: Training data and methods specific to GLM-5.2

How it works: GLM-5.2 builds on GLM-5. The team modified the earlier model’s implementation of DeepSeek sparse attention to reduce the processing required, making it practical to expand the input context to 1 million tokens from GLM-5’s 200,000 tokens.

Z.ai trained GLM-5.2 specifically on long-running agentic tasks such as deep research, code deployment and performance optimization, and complex debugging.
Earlier GLM models learned via Group Relative Policy Optimization, a reinforcement learning method that dispenses greater rewards for attempts to complete a task that outperform the average of several attempts. However, GLM-5.2’s tasks ran long enough that the team divided individual attempts into pieces, so they couldn’t average them easily. Instead, the team switched to Proximal Policy Optimization, which judges each attempt individually via a critic model.
During reinforcement learning, a coding task usually is graded based on a pass-fail check. But an agentic process, instead of solving a problem, can pass such tests by using tools to, for instance, fetch reference solutions from GitHub. GLM-5.2 resorted to such reward hacking more frequently than GLM-5.1 did. To address this, the team added a rule-based filter that flagged suspect tool calls, used a separate language model to judge whether each flagged call shortcutted the task, and blocked such calls by feeding GLM-5.2 dummy data so training could continue.
To reduce the computation required to process attention over longer contexts, the model uses a sparse attention indexer — a component that, for each new token, selects which earlier tokens to attend to — once every four layers instead of every layer, reusing its output for the other three layers. The company says this cuts per-token computation by 2.9 times within 1-million-token context. This approach modifies an earlier method called IndexCache.
To generate tokens faster, a small draft model proposes several tokens at once and the main model accepts or rejects them in a process called speculative decoding. GLM-5.2 accepts 5.47 tokens compared to GLM-5.1’s 4.56, a 20 percent gain.

Performance: GLM-5.2 posted the strongest performance of any open-weights model in Artificial Analysis’s tests. It ran close to the leading proprietary models from Anthropic and OpenAI on several agentic benchmarks, edging ahead of some and trailing others narrowly.

On Artificial Analysis’s Intelligence Index, a composite of 9 evaluations of economically useful tasks, GLM-5.2 set to max reasoning (51) ranked first among open-weights models, behind Claude Opus 4.8 set to max reasoning (56) and GPT-5.5 set to xhigh reasoning (55) but well ahead of DeepSeek V4 Pro set to max reasoning and MiniMax-M3 set to unspecified reasoning (tied at 44).
On the Arena.ai Code Arena’s WebDev leaderboard, which ranks models according to human votes on web development tasks, GLM-5.2 set to max reasoning (1,593 Elo) ranked second behind Claude Fable 5 (1,654 Elo) and ahead of all variants of Claude Opus 4 and GPT-5.5.
On PostTrainBench, a test that asks an agent to fine-tune four large language models and evaluates their performance on seven benchmarks, GLM-5.2 set to max reasoning (34.3 percent) narrowly topped Claude Opus 4.8 set to max reasoning (34.1 percent) and Claude Fable 5 set to max reasoning (30.7 percent).
On AA-Briefcase, an Artificial Analysis benchmark introduced in June 2026 that scores agents’ ability to generate business documents such as spreadsheets, presentations, and memos, GLM-5.2 set to max reasoning (1,266 Elo) led all open-weights models and placed third overall behind Claude Fable 5 (1,587 Elo) and Claude Opus 4.8 set to max reasoning (1,356 Elo).

Behind the news: High-performance, open-weights models become more attractive as the U.S. government and U.S. companies tighten the screws on AI technology developed within the country. Z.ai released GLM-5.2 only one day after the U.S. government restricted access to Anthropic’s Claude Fable 5 and Claude Mythos 5 to citizens, and Anthropic suspended access to Claude Fable 5.

Why it matters: Beyond GLM-5.2’s open license, the low cost of Z.ai’s API gives developers an additional incentive to adopt it. Developers who find Claude Opus 4.8 or GPT-5.5 too pricey can obtain similar agentic and coding capabilities for as little as a quarter of the cost, according to Artificial Analysis’ assessment of cost per intelligence.

We’re thinking: Open weights continue to close in on top closed models. GLM-5.2’s outstanding performance on web-dev and post-training tasks suggests that advanced agentic capabilities are available to anyone with sufficiently advanced hardware, free of charge.

Subscribe to The Batch