Qwen3.7-Max Adds Speed and Power Alibaba's latest proprietary model challenges U.S. rivals

Published
Reading time
3 min read
Flowchart depicting LLMs memorizing and responding to state media, affecting language-specific outputs.
Loading the Elevenlabs Text to Speech AudioNative Player...

Alibaba updated its flagship large language model for long-running agentic work, pushing it into the top rank among LLMs built in China.

What’s new: Alibaba positions Qwen3.7-Max as its preferred model for text-only work like coding and scientific discovery. Like other top-tier Qwen models since late 2025, its weights are not open. (Simultaneously Alibaba released the multimodal Qwen3.7-Plus-Preview.)

  • Input/output: Text in (up to 1 million tokens), text out (up to 64,000 tokens, 208.3 tokens per second)
  • Features: Reasoning, tool use, prompt caching, native compatibility with OpenAI’s and Anthropic’s API specifications, ability to retain reasoning text across turns
  • Performance: Ranks seventh on Artificial Analysis Intelligence Index
  • Availability: Free via Qwen Chat (account required); API via Alibaba Cloud Model Studio $2.50/$0.25/$7.50 per million input/cached/output tokens
  • Undisclosed: Parameter count, architecture, training data and methods

How it works: Alibaba described Qwen3.7-Max’s reinforcement-learning approach at a high level. The approach separates three components that Alibaba says are typically coupled in agent training: the task to be performed, an agentic harness that calls tools, and a verifier that decides whether the system succeeded. Alibaba trained the model on many combinations of task, harness, and verifier to prevent it from learning tricks specific to a single setup. 

Performance: Qwen3.7-Max trails the top tier of reasoning models on the Artificial Analysis Intelligence Index, just behind leading U.S. models from OpenAI, Anthropic, and Google. It excels at delivering correct output partly by declining to respond more often than peers.

  • On the Artificial Analysis Intelligence Index, a composite of 10 tests of economically useful tasks, Qwen3.7-Max set to reasoning (56.6) ranks fifth or seventh depending on the reasoning levels of various models. It’s behind Gemini 3.1 Pro Preview set to an unspecified level of reasoning (57.2) and ahead of Google Gemini 3.5 Flash set to high reasoning (55.3). Running those evaluations consumed roughly 97 million output tokens, well above the average 35 million tokens.
  • On AA-Omniscience, an Artificial Analysis measure of factual knowledge that rewards correct output, penalizes incorrect output, and doesn’t count abstentions — which helps to distinguish between models that do and don’t acknowledge the limits of their knowledge — Qwen3.7-Max set to an unspecified level of reasoning ranked sixth (14), well behind Gemini 3.1 Pro Preview set to reasoning (33) and ahead of Claude Sonnet 4.6 set to max reasoning (12). Qwen3.7-Max’s 23 percent hallucination rate was the lowest among frontier models tested, but did so partly by declining to respond to more than half of the prompts.
  • In Artificial Analysis’ measure of output speed, Qwen3.7-Max set to an unspecified level of reasoning (208 tokens per second) tied for third place with Gemini 3.5 Flash — reasoning unspecified — just behind GPT-OSS 120B (313 tokens per second) and GPT-OSS 20B (238 tokens per second).

Yes, but: Although Alibaba touts Qwen3.7-Max’s agentic capabilities, the claim is based on an internal test that is not yet validated by independent benchmarks. The model autonomously optimized an attention kernel on hardware it had not encountered during training. In 35 hours, it made 1,158 tool calls and ran 432 kernel evaluations (test runs of candidate code). The resulting code ran roughly 10 times faster than a standard reference implementation. Artificial Analysis has not yet tested Qwen3.7-Max on its benchmark of long-running agentic tasks.

Behind the news: Qwen3.7-Max continues Alibaba’s shift from open to closed models. In addition to Qwen3.7-Max, Qwen3.6-Max-Preview and Qwen3.6-Plus have closed weights, while the weights for the less capable Qwen3.6-27B and Qwen3.6-35B-A3B are freely available. At the same time, Alibaba started charging for access to Qwen Code, a command-line coding tool. These changes follow turnover in the Qwen team’s leadership and suggest that Alibaba aims to leverage its top-tier models to produce revenue rather than maximize its reach. 

Why it matters: Qwen3.7-Max is the smartest Chinese LLM, judging by the Artificial Analysis Intelligence Index, and it’s the third-fastest overall.

We’re thinking: We’re saddened by Alibaba’s turn toward closed weights, but we’re pleased that it’s keeping its lower tiers open. AI companies need to innovate in ways to turn open weights into revenue as well as innovating in model architectures and training methods.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox