AI Safety

35 Posts

Chart illustrates exact and approximate memorization percentages in different Gemma models.
AI Safety

Masking Private Data in Training Sets: Google researchers released VaultGemma, an open-weights model redacting personal information

Large language models often memorize details in their training data, including private information that may appear only once, like a person’s name, address, or phone number. Researchers built the first open-weights language model that’s guaranteed not to remember such facts.
Graph showing increasing security risks from 9% to 92% as MCP servers rise from 1 to 10.
AI Safety

MCP Poses Security Risks: Experts identify holes in the popular Model Context Protocol for attackers to access data

The ability to easily connect large language models to tools and data sources has made Model Context Protocol popular among developers, but it also opens security holes, research shows.
AI chatbot interfaces showing tour guide, outdoor adventurer, and custom characters as Meta and OpenAI add safety controls.
AI Safety

Meta, OpenAI Reinforce Guardrails: Meta and OpenAI respond to criticism by adding new rules for teens’ chatbot use

Meta and OpenAI promised to place more controls on their chatbots’ conversations with children and teenagers, as worrisome interactions with minors come under increasing scrutiny.
Charts showing PromptGuard 2 blocking attacks, AlignmentCheck detecting goal hijacking, and CodeShield finding insecure code.
AI Safety

Cybersecurity for Agents: Meta releases LlamaFirewall, an open-source defense against AI hijacking

Autonomous agents built on large language models introduce distinct security concerns. Researchers designed a system to protect agents from common vulnerabilities.
Graph showing frequent chatbot users report lower well-being, based on Character.AI usage and survey analysis.
AI Safety

People With AI Friends Feel Worse: Study shows heavy use of AI companions correlates with lower emotional well-being

People who turn to chatbots for companionship show indications of lower self-reported well-being, researchers found.
Robot hand gripping seal of the U.S. Executive Office of the President, symbolizing government control over national AI policy.
AI Safety

White House Resets U.S. AI Policy: How the White House's Action Plan aims to build AI leadership, infrastructure, and innovation

President Trump set forth principles of an aggressive national AI policy, and he moved to implement them through an action plan and executive orders.
Diagram showing how a language model agent gets misled by malicious posts and sites when searching for Nike shoes online.
AI Safety

Phishing for Agents: Columbia University researchers show how to trick trusting AI agents with poisoned links

Researchers identified a simple way to mislead autonomous agents based on large language models.
Colorful abstract geometric pattern with intersecting green 'X' and diagonal shapes on red, blue, and orange backgrounds, reminiscent of the South African flag
AI Safety

Grok’s Fixation on South Africa: xAI blames unnamed, unauthorized employee for chatbot introducing "white genocide" into conversations

An unauthorized update by an xAI employee caused the Grok chatbot to introduce South African politics into unrelated conversations, the company said.
Man at desk overwhelmed by robot coworkers in office setting with city and tree views.
AI Safety

The User Is Always… a Genius!: OpenAI pulls GPT-4o update after users report sycophantic behavior

OpenAI’s most widely used model briefly developed a habit of flattering users, with laughable and sometimes worrisome results.
Illustration of a businessman in a blue suit sitting alone at the head of a long boardroom table with black chairs.
AI Safety

The Fall and Rise of Sam Altman: Inside Sam Altman’s brief ouster from OpenAI

A behind-the-scenes account provides new details about the abrupt firing and reinstatement of OpenAI CEO Sam Altman in November 2023.
Colorful AI-themed labyrinth game interface with multiple characters and neural icons in a futuristic digital design.
AI Safety

Scraping the Web? Beware the Maze: Cloudflare’s AI Labyrinth traps scrapers with decoy pages

Bots that scrape websites for AI training data often ignore do-not-crawl requests. Now web publishers can enforce such appeals by luring scrapers to AI-generated decoy pages.
Claude 3 Opus performs the Self-Exfiltration task, balancing renewable goals and corporate priorities.
AI Safety

Models Can Use Tools in Deceptive Ways: Researchers expose AI models' deceptive behaviors

Large language models have been shown to be capable of lying when users unintentionally give them an incentive to do so. Further research shows that LLMs with access to tools can be incentivized to use them in deceptive ways.
User entering ZIP code ‘94103’ in U.S. General Election ballot lookup to view contests and candidates.
AI Safety

Voter’s Helper: Perplexity’s AI-powered U.S. election hub assists voters with verified, real-time news and insights

Some voters navigated last week’s United States elections with help from a large language model that generated output based on verified, nonpartisan information.
Cartoon characters in costume contest: ghost wins 1st, mad scientist 2nd, hula girl 3rd.
AI Safety

Innovation Can’t Win: Bureaucracy chokes AI growth as lawmakers tighten grip

Politicians and pundits have conjured visions of doom to convince lawmakers to clamp down on AI. What if terrified legislators choke off innovation in AI?
LLM leaderboard with Chinese models rising in ranks.
AI Safety

A Year of Contending Forces: State of AI report highlights 2024’s major trends and breakthroughs

A new report documents the interplay of powerful forces that drove AI over the past year: open versus proprietary technology, public versus private financing, innovation versus caution. 
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox