Agentic Tests Beyond the Bug Hunt DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

Published
Reading time
4 min read
A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.
Loading the Elevenlabs Text to Speech AudioNative Player...

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.

What’s new: Three recently released benchmarks are strong contenders to replace the SWE-bench family (SWE-bench, SWE-Bench Pro, SWE-bench Multilingual, and SWE-bench Verified).

  • DeepSWE, which measures agents’ feature-implementation capabilities by posing problems that are more challenging to diagnose and require more code to solve.
  • ProgramBench measures how well agents can develop new programs from ideas entered as prompts.
  • ITBench-AA extends testing an agent’s ability to diagnose issues within modern hardware stacks.

DeepSWE: Developed by Datacurve, DeepSWE is closest in intent to SWE-bench, which has been forked multiple times since LLMs started routinely acing it. DeepSWE presents examples that have been vetted by human experts and minimizes the risk that it might contaminate training datasets by drawing examples from private code bases. It consists of 113 problems in 5 languages. Independent benchmark firm Artificial Analysis recently replaced SWE-Bench Pro with DeepSWE for its Intelligence and Coding Agent indices.

  • Given a brief prompt (in contrast to the detailed prompts in SWE-Bench Pro, a revision of SWE-bench), the agentic harness mini-swe-agent must use an LLM to devise a solution from many acceptable possibilities. The solutions require around 5.5 times more lines of code than SWE-Bench Pro. 
  • Unlike SWE-bench, DeepSWE uses human-written problems and tests to verify potential solutions. The problems are based on real repositories, but not taken from existing or solved code. For example, one task is “Extend indexing ranges so arrays and strings support a third slice component: value[start:end:step]” in the ABS programming language GitHub repository.
  • Currently, GPT-5.5 set to xhigh reasoning leads DeepSWE, solving 70 percent of the problems. Next-best is Claude Opus 4.8, which solved 58 percent. Gemini 3 Flash achieved 5 percent, making for a 65 point spread across the three leading models.

ProgramBench: Developed by researchers at Meta, Stanford, and Harvard, ProgramBench tests how well a model controlled by the SWE-agent harness turns 200 ideas into functional programs without human oversight. The agent, which has access to a console that can execute an existing program, must reproduce the program by producing its inputs and outputs.

  • The authors built the benchmark using an agent (either mini-SWE-agent or SWE-agent) with Claude Sonnet 4.5 to (i) identify a candidate repository, (ii) build an executable program, (iii) generate tests that show what happens when the program processes various inputs, and (iv) build a testing environments including a compiled executable program, documentation of how to use the executable, and test assets the model might not be able to generate, like images.
  • The programs to be replicated range from easy to hard. For instance, one called entr simply runs a command when a file changes. A more complicated program called ffmpeg encodes, decodes, and otherwise processes audio and video.
  • So far, no model has been able to create programs that pass all the tests. Lowering the bar to passing at least 95 percent, Claude Opus 4.7 reproduced 3 percent of the programs, Claude Opus 4.6 reproduced 2.5 percent, and Claude Sonnet 4.6 reproduced 1.6 percent. At the time of publication, no other model has reproduced any of the programs.

ITBench-AA: Developed by IBM and the independent testing lab Artificial Analysis, ITBench-AA updates IBM’s earlier ITBench. It tests the ability of a model controlled by Artificial Analysis’ Stirrup harness to diagnose the technical conditions that lead software systems to make an error, such as running out of memory or changing a configuration file incorrectly.

  • ITBench-AA consists of 59 human-written incidents based on real-world events. Each incident includes alerts, events, error traces, system metrics, manifests of all the applications involved, as well as a ground-truth diagnosis of the root cause. For example, in one incident, a program faced seven different alerts that mentioned high error rates. The diagnosis was human error (servers were taken offline for maintenance).
  • ITBench-AA measures full recall, the ratio of correct diagnoses to all diagnoses; if a model misses any root cause, it will achieve zero for that incident.
  • Among models tested so far, Claude Opus 4.7 set to max reasoning achieved 46.7 percent, the highest full recall. GPT-5.5 set to xhigh reasoning achieved 45.8 percent. At the bottom of the list, Llama 3.3 70B achieved 0.6 percent, a spread of over 40 percent.

Why it matters: For years, the best measurements of a model’s general agentic capabilities were SWE-bench and its variants. They were designed primarily to measure the ability of models, and later agents, to fix bugs and solve other basic software engineering problems. Over time, the models became capable enough to achieve nearly 100 percent (possibly because the benchmark problems found their way into the models’ training data). Meanwhile, agents took on more difficult tasks, running longer with less-specific and less-consistent human instructions. DeepSWE, ProgramBench, and ITBench-AA, despite their different approaches, all pose problems that add both complexity and specificity and are unlikely to be in models’ training sets.

We're thinking: It’s heartening to see how far agents have come, and humbling to know how much room they still have to improve.

Share

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox