AI Safety

39 Posts

Diagram shows AI traits with pipelines for "evil" vs. "helpful" responses to user queries on animal treatment.
AI Safety

Toward Steering LLM Personality: Persona Vectors allow model builders to identify and edit out sycophancy, hallucinations, and more

Large language models can develop character traits like cheerfulness or sycophancy during fine-tuning. Researchers developed a method to identify, monitor, and control such traits.
Visual map outlines cybercrime operation phases, highlighting AI-driven processes and human validation steps.
AI Safety

Anthropic Cyberattack Report Sparks Controversy: Security researchers question whether coding agents allow unprecedented automated attacks

Independent cybersecurity researchers pushed back on a report by Anthropic that claimed hackers had used its Claude Code agentic coding system to perpetrate an unprecedented automated cyberattack.
White Waymo vehicle near water, city skyline visible; displays autonomous service for urban freeways.
AI Safety

Self-Driving Cars on U.S. Freeways: Waymo deploys autonomous cars on California and Arizona expressways

Waymo became the first company to offer fully autonomous, driverless taxi service on freeways in the United States.
Icon of silhouettes of kids with a ban symbol, indicating limited chatbot use by teens.
AI Safety

Toward Safer (and Sexier) Chatbots: Inside Character AI and OpenAI’s policy changes to protect younger and vulnerable Users

Chatbot providers, facing criticism for engaging troubled users in conversations that deepen their distress, are updating their services to provide wholesome interactions to younger users while allowing adults to pursue erotic conversations.
Chart illustrates exact and approximate memorization percentages in different Gemma models.
AI Safety

Masking Private Data in Training Sets: Google researchers released VaultGemma, an open-weights model redacting personal information

Large language models often memorize details in their training data, including private information that may appear only once, like a person’s name, address, or phone number. Researchers built the first open-weights language model that’s guaranteed not to remember such facts.
Graph showing increasing security risks from 9% to 92% as MCP servers rise from 1 to 10.
AI Safety

MCP Poses Security Risks: Experts identify holes in the popular Model Context Protocol for attackers to access data

The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows.
AI chatbot interfaces showing tour guide, outdoor adventurer, and custom characters as Meta and OpenAI add safety controls.
AI Safety

Meta, OpenAI Reinforce Guardrails: Meta and OpenAI respond to criticism by adding new rules for teens’ chatbot use

Meta and OpenAI promised to place more controls on their chatbots’ conversations with children and teenagers, as worrisome interactions with minors come under increasing scrutiny.
Charts showing PromptGuard 2 blocking attacks, AlignmentCheck detecting goal hijacking, and CodeShield finding insecure code.
AI Safety

Cybersecurity for Agents: Meta releases LlamaFirewall, an open-source defense against AI hijacking

Autonomous agents built on large language models introduce distinct security concerns. Researchers designed a system to protect agents from common vulnerabilities.
Graph showing frequent chatbot users report lower well-being, based on Character.AI usage and survey analysis.
AI Safety

People With AI Friends Feel Worse: Study shows heavy use of AI companions correlates with lower emotional well-being

People who turn to chatbots for companionship show indications of lower self-reported well-being, researchers found.
Robot hand gripping seal of the U.S. Executive Office of the President, symbolizing government control over national AI policy.
AI Safety

White House Resets U.S. AI Policy: How the White House's Action Plan aims to build AI leadership, infrastructure, and innovation

President Trump set forth principles of an aggressive national AI policy, and he moved to implement them through an action plan and executive orders.
Diagram showing how a language model agent gets misled by malicious posts and sites when searching for Nike shoes online.
AI Safety

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Researchers identified a simple way to mislead autonomous agents based on large language models.
Colorful abstract geometric pattern with intersecting green 'X' and diagonal shapes on red, blue, and orange backgrounds, reminiscent of the South African flag
AI Safety

Grok’s Fixation on South Africa: xAI blames unnamed, unauthorized employee for chatbot introducing "white genocide" into conversations

An unauthorized update by an xAI employee caused the Grok chatbot to introduce South African politics into unrelated conversations, the company said.
Man at desk overwhelmed by robot coworkers in office setting with city and tree views.
AI Safety

The User Is Always… a Genius!: OpenAI pulls GPT-4o update after users report sycophantic behavior

OpenAI’s most widely used model briefly developed a habit of flattering users, with laughable and sometimes worrisome results.
Illustration of a businessman in a blue suit sitting alone at the head of a long boardroom table with black chairs.
AI Safety

The Fall and Rise of Sam Altman: Inside Sam Altman’s brief ouster from OpenAI

A behind-the-scenes account provides new details about the abrupt firing and reinstatement of OpenAI CEO Sam Altman in November 2023.
Colorful AI-themed labyrinth game interface with multiple characters and neural icons in a futuristic digital design.
AI Safety

Scraping the Web? Beware the Maze: Cloudflare’s AI Labyrinth traps scrapers with decoy pages

Bots that scrape websites for AI training data often ignore do-not-crawl requests. Now web publishers can enforce such appeals by luring scrapers to AI-generated decoy pages.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox