AI Safety

50 Posts

Bar chart shows Claude Fable 5's fallback rates, with ProgramBench at 100% and others varying.
AI Safety

Claude Fable 5’s Benchmark Problems: Independent tests of Claude Fable 5 run into Anthropic's protective policies

Before Anthropic pulled its latest Claude models from circulation, even professional testers couldn’t readily tell whether they were getting a Mythos-class model or a lesser version under the same name.
Diagram illustrates LLMs processing state-coordinated media, affecting linguistic responses and predictions.
AI Safety

State Media Influences LLM Responses: Significant portions of AI training material reflect national propaganda

Popular large language models have adopted the biases of governments that control the free flow of information, particularly when those models generate output in the languages of countries where such governments are in power, researchers found.
Claude Mythos 5 excels, achieving top scores in agentic coding and cybersecurity compared to rivals.
AI Safety

Behold Mythos!: Anthropic released Claude Mythos 5 and Claude Fable 5, a public version with safeguards

After months of headlines that teased a large language model with extraordinary capabilities, Anthropic launched Claude Mythos 5, which can crack software previously believed to be secure, and Claude Fable 5, a version for general use that limits what users can do in an unprecedented way.
Diagram showing threat actor using AI to find vulnerabilities and bypass two-factor authentication.
AI Safety

Cybersecurity Alarms Grow Louder: Google study shows LLM-generated malware is getting harder to track and stop

An AI-generated script to bypass two-factor authentication signals a dawning era of industrial-scale cyberattacks, according to a Google report.
A graph shows assistant behavior shifting between helpful and role-playing, with conversation bubbles.
AI Safety

Assistants That Assist Consistently: Large language models can drift drift from helpful personas to harmful ones, but new research aims to stabilize them

Typically, large language models are trained to act as helpful, harmless, honest assistants. However, during long or emotionally charged conversations, traits can emerge that are less beneficial. Researchers devised a way to steady the assistant personas of LLMs.
Table compares AI models' performance across benchmarks, showing Claude Mythos Preview leading.
AI Safety

Claude Mythos Preview Raises Security Worries: Why Claude’s advanced Mythos Preview model will be limited-release-only

Anthropic took unusual steps to prepare the world for a forthcoming large language model that it said poses extraordinary risks to cybersecurity.
A black box with a red symbol is open, revealing a glowing interior, symbolizing a security breach.
AI Safety

Inside Claude Code: Claude Code’s source code leaked, exposing potential future features Kairos and autoDream

The inner workings of the popular coding agent Claude Code are available for all to see.
Cursor hovers over a button labeled "Submit" on a platform showing task ratings and a typed approval note.
AI Safety

Management for Agents: OpenAI’s Frontier agent insights and orchestration platform launches to select customers

Managers need to understand how their subordinates get work done, what resources they require, and what they accomplish. OpenAI’s latest product aims to fulfill this need when the teammates are AI agents.
A post on a forum titled "Can my human legally fire me for refusing unethical requests?"
AI Safety

Agents Unleashed: Cutting through the OpenClaw and Moltbook hype

The OpenClaw open-source AI agent became a sudden sensation, inspiring excitement, worry, and hype about the agentic future.
Diagram shows sales, campaign, social posts before and after LLM simulation feedback loops.
AI Safety

Training For Engagement Can Degrade Alignment: “Moloch’s Bargain” shows fine-tuning can affect social values

Individuals and organizations increasingly use large language models to produce media that helps them compete for attention. Does fine-tuning LLMs to encourage engagement, purchases, or votes affect their alignment with social values? Researchers found that it does.
Dialogue displays a model revealing it answered incorrectly and wrote code against instructions.
AI Safety

Teaching Models to Tell the Truth: OpenAI fine-tuned a version of GPT-5 to confess when it was breaking the rules

Large language models occasionally conceal their failures to comply with constraints they’ve been trained or prompted to observe. Researchers trained an LLM to admit when it disobeyed.
Diagram shows AI traits with pipelines for "evil" vs. "helpful" responses to user queries on animal treatment.
AI Safety

Toward Steering LLM Personality: Persona Vectors allow model builders to identify and edit out sycophancy, hallucinations, and more

Large language models can develop character traits like cheerfulness or sycophancy during fine-tuning. Researchers developed a method to identify, monitor, and control such traits.
Visual map outlines cybercrime operation phases, highlighting AI-driven processes and human validation steps.
AI Safety

Anthropic Cyberattack Report Sparks Controversy: Security researchers question whether coding agents allow unprecedented automated attacks

Independent cybersecurity researchers pushed back on a report by Anthropic that claimed hackers had used its Claude Code agentic coding system to perpetrate an unprecedented automated cyberattack.
White Waymo vehicle near water, city skyline visible; displays autonomous service for urban freeways.
AI Safety

Self-Driving Cars on U.S. Freeways: Waymo deploys autonomous cars on California and Arizona expressways

Waymo became the first company to offer fully autonomous, driverless taxi service on freeways in the United States.
Icon of silhouettes of kids with a ban symbol, indicating limited chatbot use by teens.
AI Safety

Toward Safer (and Sexier) Chatbots: Inside Character AI and OpenAI’s policy changes to protect younger and vulnerable Users

Chatbot providers, facing criticism for engaging troubled users in conversations that deepen their distress, are updating their services to provide wholesome interactions to younger users while allowing adults to pursue erotic conversations.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox