Meta pivoted from its open-weights strategy to deliver a closed alternative.
What’s new: Meta introduced its first AI model in a year and the first product of its nine-month-old Superintelligence Labs. Muse Spark is a natively multimodal reasoning model with support for tool use and multi-agent orchestration. It leads in some health and multimodal benchmarks but falls short in coding and agentic work, which Meta frames as validating an architectural redesign on which the company plans to build larger models.
- Input/output: Text, image, speech in (up to 262,000 tokens), text out
- Performance: Fourth place on the Artificial Analysis Intelligence Index
- Availability: Free via meta.ai and Meta AI app; coming to WhatsApp, Instagram, Facebook, Messenger, and Ray-Ban Meta AI glasses; API preview for selected partners
- Features: Three reasoning modes (instant, thinking, contemplating), shopping mode
- Undisclosed: Parameter count, architecture, training data and methods, output size limit
How it works: Meta disclosed limited technical details about Muse Spark but highlighted gains in training efficiency and multi-agent orchestration plus a domain-specific investment in health.
- The company reworked its pretraining approach, model architecture, optimization, and data curation. Meta says Muse Spark matches Llama 4 Maverick’s capabilities with over an order of magnitude less processing devoted to training.
- Post-training involved reinforcement learning in which the team penalized the model for using excessive reasoning tokens, a process the team calls thought compression. Under this penalty, the model first improved by reasoning longer, then learned to compress its reasoning, and then extended its reasoning for further improvement.
- Rather than processing a single chain of thought, contemplating mode launches multiple agents that propose solutions, refine them, and aggregate the results in parallel. Meta says this achieves better performance while incurring comparable latency.
- To improve health reasoning, Meta enlisted more than 1,000 physicians to help curate training data aimed at producing more accurate and thorough health responses.
Results: Muse Spark’s benchmark performance is generally competitive and notably token-efficient. Meta acknowledged that it shows gaps in coding and agentic performance.
- On the Artificial Analysis Intelligence Index, a composite of 10 benchmarks of economically useful tasks, Muse Spark set to reasoning (52) places fourth overall behind the tied-for-third Gemini 3.1 Pro Preview set to high reasoning and GPT-5.4 set to xhigh reasoning (both 57), and Claude Opus 4.6 set to max reasoning (53). Muse Spark used around 59 million tokens to complete the index, compared to roughly 158 million tokens for Claude Opus 4.6 and 116 million tokens for GPT-5.4.
- Muse Spark earns top marks in at least one multimodal benchmark. On CharXiv Reasoning (understanding charts and figures), Muse Spark (86.4 percent) outperformed GPT-5.4 (82.8 percent) and Gemini 3.1 Pro (80.2 percent), according to Meta. On MMMU Pro (solving multidisciplinary visual problems), Muse Spark (81 percent) placed second behind Gemini 3.1 Pro (82 percent), according to Artificial Analysis.
- On Artificial Analysis’ Coding Index, a weighted average of coding benchmarks, Muse Spark (47) fell behind GPT-5.4 (57), Gemini 3.1 Pro Preview (56), and Claude Sonnet 4.6 set to max reasoning (51).
- Artificial Analysis independently measured Muse Spark in Thinking mode at 39.9 percent on Humanity’s Last Exam, trailing Gemini 3.1 Pro Preview (44.7 percent) and GPT-5.4 (41.6 percent). However, Meta reports 58 percent when Muse Spark used contemplating mode.
- In Meta’s tests, Muse Spark outperformed all models on HealthBench Hard, a subset of OpenAI’s health benchmark, at 42.8 percent, ahead of second-best GPT-5.4 (40.1 percent). Muse Spark also led DeepSearchQA, an agentic browsing evaluation, at 74.8 percent, ahead of Claude Opus 4.6 Max (73.7 percent).
Behind the news: Muse Spark is the Meta’s first new model since it reorganized its AI labs after critics alleged that the training data for Llama 4 been contaminated with benchmark answers. In June 2025, Meta spent $14.3 billion for a 49 percent stake in Scale AI, brought in cofounder Alexandr Wang as chief AI officer, and launched a hiring spree with pay packages worth hundreds of millions of dollars. The proprietary release has raised concerns among developers, many of whom have built projects on open-weights Llama models.
Why it matters: Meta is investing in the capabilities that matter most for its product ambitions: multimodal perception for billions of camera-equipped users, health reasoning for one of the most common categories of AI queries, and multi-agent coordination for multi-step tasks. With a private API preview in progress, it’s positioning itself to compete for business customers alongside OpenAI, Google, and Anthropic. However, its pivot away from being the leading U.S. champion of open weights is a significant loss for the developer community.
We’re thinking: Muse Spark’s contemplating mode and Kimi K2.5’s Agent Swarm point to an emerging pattern: More labs are scaling performance by training models to orchestrate multiple agents at inference time rather than training ever-larger single models.