In today’s edition of Data Points, you’ll learn more about:
- Claude Fable 5 no longer silently degrades
- Hermes Agent maker streamlines setup
- Agents’ Last Exam pushes top models
- Gemini-SQL2 translates database queries
But first:
Google turns to diffusion text generation for efficiency and speed
Google introduced DiffusionGemma, an experimental 26B Mixture of Experts model that abandons the sequential token-by-token generation of standard language models in favor of diffusion-based text generation. Instead of predicting words one at a time, DiffusionGemma generates entire 256-token blocks simultaneously, achieving over 1,000 tokens per second on an NVIDIA H100 GPU, up to four times faster than autoregressive models. The approach works by starting with random placeholder tokens and iteratively refining them across multiple passes, similar to how diffusion models generate images from noise. The 26B model activates only 3.8B parameters during inference, fitting within 18GB of VRAM on consumer GPUs when quantized, making it practical for local deployment. Google openly acknowledges the trade-off: Output quality is lower than standard Gemma 4, and the speed advantage only applies to single-user, low-concurrency inference; cloud deployments running many requests simultaneously get no benefit and may cost more. The model opens up new use cases like in-line code editing and tasks requiring bidirectional attention, such as Sudoku solving, where each token depends on future context. (Google)
Anthropic shuts down access to Claude Fable and Mythos worldwide
The US government issued an export control directive ordering Anthropic to suspend access to Claude Fable 5 and Claude Mythos 5 for all foreign nationals — whether inside or outside the United States — citing national security concerns about a potential jailbreak method. Because Anthropic could not selectively filter foreign nationals from domestic users in real time, the company chose to disable both models for all customers to ensure compliance. The government demonstrated a narrow bypass technique that Anthropic characterizes as asking the model to review code for flaws, a capability the company says is widely available in other models like OpenAI’s GPT-5.5 and used routinely by security defenders. Anthropic argues the jailbreak is non-universal and minor compared to the safeguards it instituted through thousands of hours of red-teaming with government and third-party organizations, and contends that applying this standard industry-wide would effectively halt new model deployments. The company disputes the rationale but has removed access to Fable and Mythos for all users while it works to come to an agreement with the government. (Anthropic)
Anthropic removes silent safeguards on Claude Fable 5
Shortly before it removed access to the model, Anthropic backed away from a controversial safeguard in Claude Fable 5 that would have covertly degraded the model’s performance for researchers attempting to use it for competitive AI development. The company had planned to silently limit capabilities without alerting users they’d triggered the restriction, a move it argued would slow frontier AI development and prevent adversaries from accessing cutting-edge tools. Researchers and AI safety advocates quickly condemned the approach as counterproductive, arguing it would concentrate advanced research in a handful of labs and prevent independent evaluation of AI safety. Anthropic now says it will make these safeguards visible, alerting users when they’ve requested something forbidden or been routed to a less capable model, though this means more requests may trigger alerts than before. The reversal reveals a core tension: the company’s stated commitment to distributed AI safety research conflicts with its business interest in controlling access to its most capable models. (Wired)
Hermes Agent maker streamlines setup with simpler workflow
Nous Research shipped a Profile Builder dashboard for Hermes Agent, replacing multiple CLI commands with a single guided browser flow. The builder lets you define an agent’s identity, select a model and provider, toggle built-in skills, install skills from a hub, and attach MCP servers, all from one interface. Each profile becomes an isolated agent with its own config.yaml, .env, and state database, so a coding agent and research agent never collide. The dashboard runs locally on localhost by default and writes output to standard profile files, keeping the CLI available for scripting. (MarkTechPost)
GPT-5.5 edges Claude Fable 5 on challenging agentic benchmark
UC Berkeley’s Center for Responsible, Decentralized Intelligence launched Agents’ Last Exam (ALE), a benchmark built around 1,500+ authentic professional workflows drawn with reference to the U.S. federal O*NET/SOC occupational taxonomy, covering 55 industry subfields grouped into 13 industry clusters. The tasks are genuine: create a 3D model in Siemens NX, analyze neuroimaging data in FSLeyes, compose visual effects in After Effects. OpenAI’s GPT-5.5 running through the Codex harness topped the leaderboard with a 24.0 percent pass rate; GPT-5.5 also claimed second place (23.0 percent via ALE Claw), with Anthropic’s freshly released Claude Fable 5 coming in third at 22.0 percent. But winning with a score that low is its own kind of indictment. On the hardest “Last-Exam” tier, most configurations, including Google’s Gemini CLI and Anthropic’s Claude Opus 4.8, scored 0.0 percent. ALE addresses two chronic problems with AI evaluation: grading integrity and benchmark contamination. Rather than using LLM-as-a-judge methods (employed for only 6.8 percent of tasks), it relies on deterministic code-based comparison against expert ground-truth artifacts, and keeps roughly 90 percent of its task set private, rotating questions in and out to prevent memorization from inflating scores. (VentureBeat)
New Gemini tool tops other models at database query generation
Google announced Gemini-SQL2, a text-to-SQL capability built on Gemini 3.1 Pro that converts natural language questions into executable SQL queries. The system achieved 80.04 percent execution accuracy on the BIRD leaderboard’s single-model track—a metric that measures whether generated SQL actually runs and returns correct results, not just whether it looks syntactically valid. This marks an approximately 2.84-point improvement over Google’s previous Gemini-SQL system (which scored roughly 77.2 percent on the same benchmark) and positions the company ahead of competitors including AWS, Databricks, and Anthropic on the benchmark. Google has not yet published an API, model card, or confirmed which products will integrate Gemini-SQL2, though the announcement hints at potential deployment in BigQuery Studio, AlloyDB AI, and Cloud SQL Studio. The 12.92-point gap between Gemini-SQL2 and human performance (92.96 percent) highlights the remaining difficulty: handling data messiness, complex business logic, and schema ambiguity—challenges the BIRD benchmark explicitly tests across 12,751 question-SQL pairs spanning 95 real-world databases. (MarkTechPost)
Want to know more about what matters in AI right now?
Read the latest issue of The Batch for in-depth analysis of news and research.
Last week, Andrew talked about experimenting with AI desktop agents for task automation, the development of an open-source project called OpenCoworker, and the importance of privacy in AI tools.
“The software that wraps around the LLM to implement a desired agentic system is called the agent harness, and it enables the LLM to drive the key loop that decides what to do next at each step. So far, most practical Agentic AI workflows (except for coding agents) have not relied on the LLM to this extent to decide what to do next. Instead, they have relied more on developer-specified workflows to deliver higher reliability. But in the past few months, frontier LLMs have advanced sufficiently for this style of harness design to provide an important, if still not entirely reliable, alternative.”
Read Andrew’s letter here.
Other top AI news and research covered in depth:
- Anthropic released Claude Mythos 5 and Claude Fable 5, a public version with safeguards, to enhance AI accessibility while prioritizing safety.
- Composer 2.5 for Cursor rivals GPT-5.5’s coding abilities at a lower price, offering a competitive alternative for developers.
- What is recursive self-improvement, and why is everybody talking about it? Discover the potential implications for AI evolution.
- Significant portions of AI training material reflect national propaganda, raising concerns about the influence of state media on LLM responses.
A special offer for our community
In case you missed it, DeepLearning.AI launched our first-ever subscription plan for our entire course catalog! As a Pro Member, you’ll immediately enjoy access to:
- Nearly 200 AI short and long courses from Andrew Ng and industry experts
- Labs and quizzes to test your knowledge
- Projects to share with employers
- Certificates to testify to your new skills
- A community to help you advance at the speed of AI
Enroll now to lock in a year of full access for $25 per month paid upfront, or opt for month-to-month payments at just $30 per month. Both payment options begin with a one-week free trial. Explore Pro’s benefits and start building today!
Data Points is produced by human editors with AI assistance.