Cursor Fits Its Model to Its Agent Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Published
Reading time
4 min read
Chart compares performance of Composer 2.5 against Opus 4.7, GPT-5.5, and Composer 2 in benchmarks.
Loading the Elevenlabs Text to Speech AudioNative Player...

Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.

What's new: Composer 2.5, the native model for the Cursor agentic software-development environment, improves upon Composer 2, released in March. Like its predecessor, Composer 2.5 is based on Moonshot’s open-weights Kimi K2.5.

  • Input/output: Text in (up to 200,000 tokens), text out; image output available via tool calls
  • Architecture: Mixture-of-experts transformer (1.04 trillion parameters, 32 billion active parameters per token)
  • Features: Function calling, reasoning, context caching
  • Performance: Ranks third on Artificial Analysis Coding Agent Index, first on SWE-Bench-Pro-Hard-AA, second on time per task and cost per task, as measured by Artificial Analysis
  • Availability: Via Cursor’s IDE $0.50/$0.20/$2.50 per million input/cached/output tokens, fast mode $3.00/$0.50/$15 input/cached/output; subscriptions for individuals, teams, and enterprises from $20 per month
  • Undisclosed: Additional pretraining and fine-tuning data

How it works: Composer 2.5 is built specifically for agentic coding. Cursor detailed its training recipe for Composer 2 in a paper and followed it for Composer 2.5. The authors took Kimi 2.5’s open weights and conducted further pretraining on a large dataset of code. They used reinforcement learning to fine-tune the resulting model using a simulated agentic harness and tools that matched Cursor CLI, the company’s own coding harness. During reinforcement learning, the model was rewarded not only for success but also for brevity and elegance of its output. The team updated the earlier training process as follows:

  • During reinforcement learning, along with rewards, the team gave the model text feedback. For example, if the model performed an incorrect tool call, they would load text into the context window that suggested better available tool calls and use the correct output to teach the model.
  • The team fine-tuned the model using 25 times as many synthetic tasks as they used for Composer 2. The primary purpose was to train the model on more-difficult tasks. For example, a synthetic task might include an input to delete an application feature; this would be paired with the resulting code, clean-up of any remaining artifacts, and a test to ensure that the modified application works. The team did not disclose the model that generated such tasks.

Performance: Composer 2.5 placed third behind Claude Opus 4.7 and GPT-5.5 on a number of independent coding benchmarks, but pulled ahead on Cursor’s own CursorBench when all tested models used default settings in Cursor CLI. It runs faster and less expensively than comparable models, typically exceeding the cost of only DeepSeek V4 Pro.

  • On the Artificial Analysis Coding Agent Index (a composite of SWE-Bench-Pro-Hard-AA, Terminal-Bench v2, and SWE-Atlas-QnA that measures a range of agentic software engineering tasks), Cursor CLI using Composer 2.5 (63) fell behind Claude Code using Claude Opus 4.7 at max reasoning (67) and Codex using GPT-5.5 at xhigh reasoning (65). It outperformed Claude Opus 4.7 and GPT-5.5 set to lower reasoning levels and the same models in Cursor’s CLI.
  • On Artificial Analysis’s time per task (measuring the mean time required to complete a task in the Coding Agent Index), Cursor CLI with Composer 2.5 took 6.7 minutes, while Claude Code with Claude Opus 4.8 set to medium reasoning took 8.8 minutes. Claude Code with Claude Opus 4.8 set to max reasoning took more than twice as long (17.7 minutes). On mean cost per task. Cursor CLI with Composer 2.5 in fast mode cost $0.44, while Claude Code with Claude Opus 4.7 Max cost $4.14.
  • On Cursor’s own CursorBench benchmark, which aims to better simulate agentic coding’s terse user inputs and harder problems, Composer 2.5 (63.2 percent) was just behind Claude Opus 4.7 (64.8 percent) and GPT-5.5 (64.3 percent) at their highest reasoning settings. It pulled ahead of both models at their default settings (61.6 and 59.2 percent)

Behind the news: In April, SpaceX obtained the right to acquire Cursor for $60 billion or pay $10 billion for their work together, as part of a broader partnership deal. Cursor will train its models using SpaceX hardware, and it is training new models from scratch, so it may not rely Moonshot’s open-weights offerings for much longer.

Why it matters: It has become common to say that harness engineering – creating the software and tooling that allows models to perform agentic tasks – is becoming as important as the underlying models themselves. By developing Composer as a specialized software engineering model, Cursor rejects that dichotomy. Fine-tuning models within the harness gives users the best of both worlds: The strengths of the model and the surrounding software are built to work together.

We’re thinking: New software-engineering models appear less frequently than they once did, as generalist models from Anthropic and OpenAI have captured the market with their versatile models operating inside their popular Claude Code and Codex coding tools. Indeed, Cursor began as an integrated developer environment (IDE) and sold access to those models long before it developed its own. But Cursor doesn’t need a generalist model; it needs one that can solve developer’s problems at high speed and low cost. Its continued success shows that there’s still a place for specialist models trained for specific tasks, even — or especially — in the agentic era.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox