AI Safety

42 Posts

A post on a forum titled "Can my human legally fire me for refusing unethical requests?"

Agents Unleashed: Cutting through the OpenClaw and Moltbook hype

The OpenClaw open-source AI agent became a sudden sensation, inspiring excitement, worry, and hype about the agentic future.

Diagram shows sales, campaign, social posts before and after LLM simulation feedback loops.

AI Safety

Training For Engagement Can Degrade Alignment: Stanford Researchers coin “Moloch’s Bargain,” show fine-tuning can affect social values

Individuals and organizations increasingly use large language models to produce media that helps them compete for attention. Does fine-tuning LLMs to encourage engagement, purchases, or votes affect their alignment with social values? Researchers found that it does.

Dialogue displays a model revealing it answered incorrectly and wrote code against instructions.

AI Safety

Teaching Models to Tell the Truth: OpenAI fine-tuned a version of GPT-5 to confess when it was breaking the rules

Large language models occasionally conceal their failures to comply with constraints they’ve been trained or prompted to observe. Researchers trained an LLM to admit when it disobeyed.

Diagram shows AI traits with pipelines for "evil" vs. "helpful" responses to user queries on animal treatment.

AI Safety

Toward Steering LLM Personality: Persona Vectors allow model builders to identify and edit out sycophancy, hallucinations, and more

Large language models can develop character traits like cheerfulness or sycophancy during fine-tuning. Researchers developed a method to identify, monitor, and control such traits.

Visual map outlines cybercrime operation phases, highlighting AI-driven processes and human validation steps.

AI Safety

Anthropic Cyberattack Report Sparks Controversy: Security researchers question whether coding agents allow unprecedented automated attacks

Independent cybersecurity researchers pushed back on a report by Anthropic that claimed hackers had used its Claude Code agentic coding system to perpetrate an unprecedented automated cyberattack.

White Waymo vehicle near water, city skyline visible; displays autonomous service for urban freeways.

AI Safety

Self-Driving Cars on U.S. Freeways: Waymo deploys autonomous cars on California and Arizona expressways

Waymo became the first company to offer fully autonomous, driverless taxi service on freeways in the United States.

Icon of silhouettes of kids with a ban symbol, indicating limited chatbot use by teens.

AI Safety

Toward Safer (and Sexier) Chatbots: Inside Character AI and OpenAI’s policy changes to protect younger and vulnerable Users

Chatbot providers, facing criticism for engaging troubled users in conversations that deepen their distress, are updating their services to provide wholesome interactions to younger users while allowing adults to pursue erotic conversations.

Chart illustrates exact and approximate memorization percentages in different Gemma models.

AI Safety

Masking Private Data in Training Sets: Google researchers released VaultGemma, an open-weights model redacting personal information

Large language models often memorize details in their training data, including private information that may appear only once, like a person’s name, address, or phone number. Researchers built the first open-weights language model that’s guaranteed not to remember such facts.

Graph showing increasing security risks from 9% to 92% as MCP servers rise from 1 to 10.

AI Safety

MCP Poses Security Risks: Experts identify holes in the popular Model Context Protocol for attackers to access data

The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows.

AI chatbot interfaces showing tour guide, outdoor adventurer, and custom characters as Meta and OpenAI add safety controls.

AI Safety

Meta, OpenAI Reinforce Guardrails: Meta and OpenAI respond to criticism by adding new rules for teens’ chatbot use

Meta and OpenAI promised to place more controls on their chatbots’ conversations with children and teenagers, as worrisome interactions with minors come under increasing scrutiny.

Charts showing PromptGuard 2 blocking attacks, AlignmentCheck detecting goal hijacking, and CodeShield finding insecure code.

AI Safety

Cybersecurity for Agents: Meta releases LlamaFirewall, an open-source defense against AI hijacking

Autonomous agents built on large language models introduce distinct security concerns. Researchers designed a system to protect agents from common vulnerabilities.

Graph showing frequent chatbot users report lower well-being, based on Character.AI usage and survey analysis.

AI Safety

People With AI Friends Feel Worse: Study shows heavy use of AI companions correlates with lower emotional well-being

People who turn to chatbots for companionship show indications of lower self-reported well-being, researchers found.

Robot hand gripping seal of the U.S. Executive Office of the President, symbolizing government control over national AI policy.

AI Safety

White House Resets U.S. AI Policy: How the White House's Action Plan aims to build AI leadership, infrastructure, and innovation

President Trump set forth principles of an aggressive national AI policy, and he moved to implement them through an action plan and executive orders.

Diagram showing how a language model agent gets misled by malicious posts and sites when searching for Nike shoes online.

AI Safety

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Researchers identified a simple way to mislead autonomous agents based on large language models.

Colorful abstract geometric pattern with intersecting green 'X' and diagonal shapes on red, blue, and orange backgrounds, reminiscent of the South African flag

AI Safety

Grok’s Fixation on South Africa: xAI blames unnamed, unauthorized employee for chatbot introducing "white genocide" into conversations

An unauthorized update by an xAI employee caused the Grok chatbot to introduce South African politics into unrelated conversations, the company said.

AI Safety

Agents Unleashed: Cutting through the OpenClaw and Moltbook hype

Training For Engagement Can Degrade Alignment: Stanford Researchers coin “Moloch’s Bargain,” show fine-tuning can affect social values

Teaching Models to Tell the Truth: OpenAI fine-tuned a version of GPT-5 to confess when it was breaking the rules

Toward Steering LLM Personality: Persona Vectors allow model builders to identify and edit out sycophancy, hallucinations, and more

Anthropic Cyberattack Report Sparks Controversy: Security researchers question whether coding agents allow unprecedented automated attacks

Self-Driving Cars on U.S. Freeways: Waymo deploys autonomous cars on California and Arizona expressways

Toward Safer (and Sexier) Chatbots: Inside Character AI and OpenAI’s policy changes to protect younger and vulnerable Users

Masking Private Data in Training Sets: Google researchers released VaultGemma, an open-weights model redacting personal information

MCP Poses Security Risks: Experts identify holes in the popular Model Context Protocol for attackers to access data

Meta, OpenAI Reinforce Guardrails: Meta and OpenAI respond to criticism by adding new rules for teens’ chatbot use

Cybersecurity for Agents: Meta releases LlamaFirewall, an open-source defense against AI hijacking

People With AI Friends Feel Worse: Study shows heavy use of AI companions correlates with lower emotional well-being

White House Resets U.S. AI Policy: How the White House's Action Plan aims to build AI leadership, infrastructure, and innovation

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Grok’s Fixation on South Africa: xAI blames unnamed, unauthorized employee for chatbot introducing "white genocide" into conversations

Subscribe to The Batch