Machine Learning Research

607 Posts

Robot arm successfully places pot on cloth, demonstrating reward verification and scoring effectiveness.
Machine Learning Research

Better Reward Models for Robots: Inside RoboReward, a family of vision-language reward models that train robots to take action

When you’re training a robot via reinforcement learning, a handcrafted reward function is labor-intensive to build but often dispenses rewards more effectively than a general-purpose reward model based on a vision-language model. Researchers built reward models that narrowed the gap.
The table shows MAI-Thinking-1 leading in several benchmarks, compared to other AI models.
Machine Learning Research

Microsoft Strikes Out on Its Own: Microsoft revealed MAI-Thinking-1, a Claude Sonnet 4.6-sized reasoning model developed without distillation

Microsoft, once OpenAI’s exclusive partner and still a major reseller of other companies’ AI models, built its own reasoning model from scratch.
Six charts show Fugu and Fugu Ultra scoring highest, marked by red bars, on various tasks and benchmarks.
Machine Learning Research

Fugu Blends Models Task by Task: Sakana debuted dedicated orchestrator models, Fugu and Fugu-Ultra, that spawn Claude, Gemini, and GPT agents

Models that orchestrate the activities of other models and agents achieved state-of-the-art performance on a variety of benchmarks, outperforming the best individual models working alone.
Detailed eagle talons grip a metallic logo amid a clear blue background, symbolizing control and power.
Machine Learning Research

GPT-5.6 Lands in Limbo: OpenAI previewed three GPT-5.6 Models (Sol, Terra, and Luna), wider release coming soon

OpenAI announced a preview of its GPT-5.6 family, including a top-tier model comparable to Claude 5 Mythos — but so far it’s available only to users that are selected by the U.S. government.
Anthropic Opus 4.8 Leaps Forward: Claude Opus 4.8 won back the high-performance crown for Anthropic, pending wider availability of its Mythos-class models
Machine Learning Research

Anthropic Opus 4.8 Leaps Forward: Claude Opus 4.8 won back the high-performance crown for Anthropic, pending wider availability of its Mythos-class models

Anthropic mid-2026 update of Opus held the the top of a leading intelligence ranking for about a week, only to be overtaken by Claude Fable 5.
Flowchart of an ESMC-6B model with sequence encoding layers, language model, and diffusion transformer output.
Machine Learning Research

Biological Molecules as Language: ESMFold2 approaches AlphaFold 3 performance but with an open, Transformer-based architecture

Google’s AlphaFold models pioneered the task of finding the shapes of biologically active molecules, opening new pathways for drug development.
AFM 3 Core model architecture visualizes DRAM and NAND processes in AI with focus on sparsely-activated LLM operations.
Machine Learning Research

Large-Model AI for Apple Devices: 2026's Apple Foundation Models bring AI to MacBooks, iPhones, and the cloud

The third generation of Apple Foundation Models — fruit of Apple’s collaboration with Google — introduces a variation on the mixture-of-experts architecture that runs on local devices. 
AI performance chart shows GLM, GPT models competing in reasoning, coding benchmarks. Models highlight performance.
Machine Learning Research

Top Agentic Performance, Low Cost: GLM-5.2, designed for coding and long-running agentic jobs, now the top open model

Z.ai released an open-weights model that rivals proprietary leaders for autonomous agentic tasks.
Flowchart illustrates the POPE method, transitioning from guided to unguided problem-solving in reinforcement learning.
Machine Learning Research

Reinforcement Learning With Hints: Privileged On-Policy Exploration (POPE) trains models to expand on partial solutions

Reinforcement learning can’t train a model to solve a difficult problem if the model doesn’t discover all the right steps.
Performance table shows Nemotron's scores across benchmarks, highlighting its strengths and weaknesses.
Machine Learning Research

Nvidia’s Nemotron Goes Big: Nvidia Nemotron 3 Ultra bets on speed and openness to win customers

Nvidia’s largest-yet model is among the best-performing from a developer based in the U.S. and among the most open developed by anyone.
A line graph compares SWE-Bench Pro and DeepSWE, showing various models' performance percentages.
Machine Learning Research

Agentic Tests Beyond the Bug Hunt: DeepSWE, ProgramBench, and ITBench-AA push agents harder than SWE-bench

SWE-bench, a family of benchmarks that focuses on an LLM’s ability to fix software bugs, is giving way to new tests that evaluate agent software-engineering performance in more challenging ways.
Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.
Machine Learning Research

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.
Diagram illustrates LLMs processing state-coordinated media, affecting linguistic responses and predictions.
Machine Learning Research

State Media Influences LLM Responses: Significant portions of AI training material reflect national propaganda

Popular large language models have adopted the biases of governments that control the free flow of information, particularly when those models generate output in the languages of countries where such governments are in power, researchers found.
Bar chart shows a sharp rise in code output per person after Claude Code's release, reaching 8x by 2026.
Machine Learning Research

RSI Is the New AGI: What Is recursive self-improvement, and why Is everybody talking about it?

The phrase recursive self-improvement erupted on social media following an Anthropic report that tracked AI-driven gains in the company’s internal software-engineering productivity.
Chart compares performance of Composer 2.5 against Opus 4.7, GPT-5.5, and Composer 2 in benchmarks.
Machine Learning Research

Cursor Fits Its Model to Its Agent: Composer 2.5 for Cursor rivals GPT-5.5's coding abilities at lower price

Cursor’s latest software engineering model rivals the performance of leading competitors like Claude Opus 4.7 and GPT 5.5 for a fraction of the price.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox