AI Agents

79 Posts

Performance table shows Nemotron's scores across benchmarks, highlighting its strengths and weaknesses.

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.

A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.

AI Agents

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.

Bar chart shows a sharp rise in code output per person after Claude Code's release, reaching 8x by 2026.

AI Agents

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

The phrase recursive self-improvement erupted on social media following an Anthropic report that tracked AI-driven gains in the company’s internal software-engineering productivity.

Chart compares performance of Composer 2.5 against Opus 4.7, GPT-5.5, and Composer 2 in benchmarks.

AI Agents

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.

Flowchart depicting LLMs memorizing and responding to state media, affecting language-specific outputs.

AI Agents

Qwen3.7-Max Adds Speed and Power: Alibaba's latest proprietary model challenges U.S. rivals

Alibaba updated its flagship large language model for long-running agentic work, pushing it into the top rank among LLMs built in China.

Doughnut chart shows 77% of agentic traffic in 2025 went to product search pages.

AI Agents

Agents Surf the AI-Written Web: Internet traffic driven by AI rripled last year, study shows

AI-driven activity on the internet rose sharply last year, a study shows.

Gemini 3.5 Flash shows improved performance, surpassing previous model scores in most benchmarks.

AI Agents

Gemini 3.5 Flash Pairs Smarts With Speed: Google's updated Flash levels up, approaching top models but raising prices

Google’s faster model brings substantive gains at a substantially higher price, part of a rising trend in prices per token.

The chart compares AI benchmark efforts with employment and capital in U.S. job sectors, highlighting discrepancies.

AI Agents

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

AI agents seem to be increasingly capable of performing economically valuable tasks, but current benchmarks measure this capability only narrowly.

A woman in martial arts attire faces off against a cartoon lobster in a futuristic cityscape.

AI Agents

Hermes Agent Challenges OpenClaw: OpenClaw created a class of personal agents; upstart Hermes Agent is outworking it

OpenClaw, the immensely popular AI agent, has fast-rising competition.

Vibrant dragon with brush poised, signifies China's decisive action in blocking the tech acquisition.

AI Agents

China Nixes Meta-Manus Tie-Up: State regulators block acquisition of an agentic startup headquarted in Singapore

China shut down Meta’s attempt to acquire agentic technology that originated within its borders, a blow to further technical interchange and investment between China and the U.S.

Graph depicts GPT-Realtime-2's performance across sectors, competing with other speech-to-speech models.

AI Agents

OpenAI Challenges Speech-to-Speech Leaders: RealTime API updates audio models that reason, transcribe, and translate

An update of OpenAI’s speech-to-speech model lets developers tune the tradeoff between speed and reasoning.

Graphs compare human and LLM performance strategies in rock-paper-scissors, highlighted by stars.

AI Agents

Strategic Thinking in LLMs vs. Humans: Researchers at UT-Austin and Google model human decision-making in Rock-Paper-Scissors

While large language models can behave in human-like ways, the similarities are superficial. A simple strategy game revealed clear differences in their strategic approaches.

Table highlights Kimi K2.6's dominance in agentic tasks with 86.3 and coding at 58.6, surpassing other models.

AI Agents

Kimi K2.6 Challenges Open-Weights Champs: Kimi K2.6 matches open Qwen3.6 Max andDeepSeek V4, falls just behind top closed models.

Moonshot AI’s updated Kimi model handles longer autonomous coding sessions and scales up its multi-agent orchestration relative to its predecessor.

GLM-5.1 excels in SWE-Bench Pro and Terminal-Bench 2.0, leading in coding and reasoning tests.

AI Agents

GLM-5.1 Aims for Long-Running Tasks: Z.ai’s GLM 5.1 evaluates interim results and may change its approach hundreds of times before it delivers final output

Z.ai updated its flagship open-weights large language model to work autonomously on single tasks for up to eight hours.

A black box with a red symbol is open, revealing a glowing interior, symbolizing a security breach.

AI Agents

Inside Claude Code: Claude Code’s source code leaked, exposing potential future features Kairos and autoDream

The inner workings of the popular coding agent Claude Code are available for all to see.

AI Agents

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Qwen3.7-Max Adds Speed and Power: Alibaba's latest proprietary model challenges U.S. rivals

Agents Surf the AI-Written Web: Internet traffic driven by AI rripled last year, study shows

Gemini 3.5 Flash Pairs Smarts With Speed: Google's updated Flash levels up, approaching top models but raising prices

Toward Agent Benchmarks That Reflect Human Work: AI agents may not be getting better at full range of economically valuable labor

Hermes Agent Challenges OpenClaw: OpenClaw created a class of personal agents; upstart Hermes Agent is outworking it

China Nixes Meta-Manus Tie-Up: State regulators block acquisition of an agentic startup headquarted in Singapore

OpenAI Challenges Speech-to-Speech Leaders: RealTime API updates audio models that reason, transcribe, and translate

Strategic Thinking in LLMs vs. Humans: Researchers at UT-Austin and Google model human decision-making in Rock-Paper-Scissors

Kimi K2.6 Challenges Open-Weights Champs: Kimi K2.6 matches open Qwen3.6 Max andDeepSeek V4, falls just behind top closed models.

GLM-5.1 Aims for Long-Running Tasks: Z.ai’s GLM 5.1 evaluates interim results and may change its approach hundreds of times before it delivers final output

Inside Claude Code: Claude Code’s source code leaked, exposing potential future features Kairos and autoDream

Subscribe to The Batch