Jun 19, 2026

6 Posts

Flowchart illustrates the POPE method, transitioning from guided to unguided problem-solving in reinforcement learning.

Jun 19, 2026

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps.

Performance table shows Nemotron's scores across benchmarks, highlighting its strengths and weaknesses.

Jun 19, 2026

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.

A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.

Jun 19, 2026

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.

Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.

Jun 19, 2026

Claude’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.

Cartoon map illustrating global collaboration in research, open source technology, infrastructure, secure data sharing.

Jun 19, 2026

Open Platforms Beat Power Plays

Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.

Jun 19, 2026

Testing Mythos and Fable, Moving Beyond SWE-bench, Nvidia's Open Contender

The Batch AI News and Insights: Over the last two weeks, both the U.S. Government and Anthropic took significant actions that demonstrated their power to control access to AI by restricting what others can do with frontier models.

Jun 19, 2026

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

Claude’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Open Platforms Beat Power Plays

Testing Mythos and Fable, Moving Beyond SWE-bench, Nvidia's Open Contender

Subscribe to The Batch