Qwen3 Takes On DeepSeek-R1 Alibaba releases the Qwen3 family of open LLMs with optional reasoning

Published

May 7, 2025

Reading time

3 min read

Alibaba’s new model family may unseat DeepSeek-R1’s four-month reign as the top open-weights large language model.

What’s new: Alibaba released weights for eight large language models, all of which offer a reasoning mode that can be switched on or off. Two use a mixture of experts (MoE) architecture: Qwen3-235B-A22B (the name indicates 235 billion parameters, 22 billion of which are active at any given time) and Qwen3-30B-A3B). The other six are dense models in sizes between 32 billion parameters and 0.6 billion parameters — tiny by LLM standards, and with reasoning, too.

Input/output: MoE models: Text in (up to 131,072 tokens), text out. Dense models: Text in (up to 32,768 tokens), text out.
MoE architecture: Transformers. Qwen3-235B-A22B: 235 billion parameters, 22 billion active at any given time. Qwen3-30B-A3B: 30.5 billion parameters, 3.3 billion active at any given time.
Dense architecture: Transformers with parameter counts of 32 billion, 14 billion, 8 billion, 4 billion, 1.7 billion, 0.6 billion
Training data: Pretrained on 36 trillion tokens, generated and scraped from the web, including textbooks, PDF documents, question-answer pairs, math problems, code
Features: Selectable reasoning mode, multilingual (119 languages and dialects)
Undisclosed: Knowledge cutoff, fine-tuning data, output limits
Availability: Free for noncommercial and commercial uses under Apache 2.0 license via HuggingFace and ModelScope
API price: Qwen3-235B-A22B: $0.22/$0.88 per million input/output tokens. Qwen3-30B-A3B: $0.15/$0.60 per million input/output tokens. Via Fireworks.ai

How it works: The Qwen3 family implements chain-of-thought reasoning in both relatively large and quite small LLMs.

The team pretrained Qwen3 models on roughly twice the data used to pretrain Qwen2.5. A substantial part of the additional data was devoted to training the model in several major languages plus region-specific dialects like Haitian, Luxembourgish, and Eastern Yiddish, and lesser-known Austronesian languages like Waray, Minangkabau, and Iloko.
Pretraining took place over three stages that progressed to longer, more complex data.
The authors fine-tuned the models on long chains of thought in domains that included coding, engineering, logical reasoning, mathematics, science, and technology.
A reward model reinforced successful completions of these tasks. The in-progress models were used to generate synthetic data to train the non-reasoning mode. Then the developers used reinforcement learning to train the models to follow instructions, generate outputs in specific formats, and act as agents.

Results: Qwen3-235B-A22B and Qwen3-30B-A3B performed as well as, or better than, leading open-weights models in tests performed by Alibaba. Qwen3-4B, too, achieved results that are competitive with many models several times its size. Alibaba didn’t provide results for the other dense models.

On coding challenges in LiveCodeBench and Codeforces, Qwen3-235B-A22B (70.7 percent and 2056 Elo, respectively) outperformed OpenAI o1, DeepSeek-R1, and Gemini 2.5 Pro, but fell behind OpenAI o4-mini set to high effort. It outperformed the same models on the Berkeley Function-Calling Leaderboard (BFCL). Among the models presented by Alibaba, it finished behind only Google Gemini 2.5-Pro testing math skills (AIME 2024, AIME 2025) and a variety of recently updated math, language, and problem-solving questions (LiveBench).
Qwen3-30B-A3B outperformed Google Gemma-3-27B-IT and DeepSeek-V3 on all benchmarks highlighted by Alibaba, and it underperformed only OpenAI GPT-4o on BFCL. On GPQA Diamond’s test of graduate-level questions in a variety of domains, Qwen3-30B-A3B (65.8 percent) outperformed next-best DeepSeek-V3.
Qwen3-4B, with 4 billion parameters, was competitive across a wide range of benchmarks against DeepSeek-V3 (671 billion parameters) and Gemma-3-27B-IT (27 billion). For instance, on both Codeforces and LiveBench, Qwen3-4B (1,671 Elo and 63.6 percent, respectively) outperformed DeepSeek-V3 (1,134 Elo and 60.5 percent).

Why it matters: Qwen3 continues a string of high-performance, open-weights models released by developers in China. Alibaba says it designed the models to do the thinking in agentic systems. Reasoning that can be switched on and off can help control costs in agentic and other applications.

We’re thinking: Alibaba’s 235-billion parameter MoE model may perform better according to benchmarks, but Qwen3-30B-A3B does nearly as well and can run locally on a pro laptop without straining its memory. Add the easy ability to switch reasoning on or off, and Qwen3’s versatile, mid-sized MoE model may turn out to be the star of the show.

Subscribe to The Batch