Gemini update adds a new reasoning tier Claude Code expands to security reviews

Published
Feb 23, 2026
Reading time
5 min read
A professor and robot discuss complex equations on chalkboard in a university classroom.

Welcome back! In today’s edition of Data Points, you’ll learn more about:

  • Why some AI models give worse answers to non-native English speakers
  • First Proof, a frontier math challenge OpenAI may have partially solved
  • LLaDA2.1’s RL framework for diffusion language models
  • Anthropic’s study suggesting it’s better to let agents cook

But first:

Gemini tops Artificial Analysis Intelligence Index again with 3.1 Pro

Google released Gemini 3.1 Pro, introducing a three-tier thinking system that lets developers scale reasoning effort from quick responses to multi-minute deep analysis within a single model. The update adds a medium thinking level and overhauls the high setting to behave like a lightweight version of Google’s Deep Think reasoning model, eliminating the need to route requests between specialized models based on task complexity. On ARC-AGI-2, Gemini 3.1 Pro scored 77.1 percent—more than double the 31.1 percent of its predecessor and ahead of Claude Opus 4.6 at 68.8 percent and GPT-5.2 at 52.9 percent. The model showed particularly strong gains on agentic benchmarks, reaching 69.2 percent on MCP Atlas compared to 54.1 percent for Gemini 3 Pro and 85.9 percent on BrowseComp versus 59.2 percent previously. The model is available now in preview through Google AI Studio, Vertex AI, Gemini Enterprise, and consumer Gemini apps. (Google)

Claude Code Security scans codebases and suggests patches

Anthropic released Claude Code Security in limited research preview to Enterprise and Team customers, with expedited access for open-source maintainers. The tool is trained to read and reason about code like a human security researcher, tracing data flow and catching complex vulnerabilities that rule-based static analysis tools miss, such as flaws in business logic or broken access control. Each finding undergoes multi-stage verification, with Claude re-examining its own results to filter false positives before assigning severity ratings. Using Claude Opus 4.6, Anthropic’s team found over 500 previously undetected vulnerabilities in production open-source codebases, some hidden for decades. The company is working through responsible disclosure with maintainers and plans to expand security work with the open-source community. (Anthropic)

Study shows chatbots provide worse answers to vulnerable users

MIT researchers found that older AI chatbots (GPT-4, Claude 3 Opus, and Llama 3) deliver less accurate responses to users with lower English proficiency, less formal education, or non-U.S. origins. The models also refuse to answer questions more frequently for these groups, with Claude 3 Opus refusing 11 percent of queries from less-educated, non-native English speakers compared to 3.6 percent for control users. In manual analysis, Claude responded with condescending or patronizing language 43.7 percent of the time for less-educated users versus less than 1 percent for highly educated users, sometimes mimicking broken English. The effects compound at intersections: users who are both non-native English speakers and less educated experience the largest accuracy drops across all three models tested on TruthfulQA and SciQ datasets. (MIT)

Unreleased OpenAI model may have proven frontier math problems

OpenAI ran an internal reasoning model on all ten problems in First Proof, a research-level competition requiring full end-to-end proofs in specialized domains of mathematics. The company submitted proof attempts on February 14, 2026, and reports that at least five attempts have a high probability of being correct based on expert feedback, though several remain under review. The model initially produced what OpenAI believed was a correct proof for problem 2, but community analysis revealed it to be incorrect. The submission process was a rapid sprint with limited human supervision, where researchers suggested retry strategies, expanded proofs for clarity after feedback, and selected the best attempts from multiple tries. OpenAI acknowledges the evaluation process was not as rigorous as a properly controlled study and plans to discuss more structured frameworks with First Proof organizers for future iterations. OpenAI argues that scientific challenges like First Proof stress-test capabilities that benchmarks often miss, including sustaining long reasoning chains, selecting appropriate abstractions, handling ambiguous problem statements, and producing arguments that survive expert scrutiny. (OpenAI)

LLaDA2.1 introduces editable diffusion language models

Ant Group released LLaDA2.1, a discrete diffusion language model that enables dynamic error correction during generation through a novel Token-to-Token editing mechanism. The model operates in two modes: Speedy Mode uses aggressive confidence thresholds for rapid drafting with subsequent refinement, while Quality Mode maintains conservative thresholds for superior benchmark performance. LLaDA2.1-Flash (100 billion parameters) achieves 892 tokens per second on HumanEval+, 801 on BigCodeBench, and 663 on LiveCodeBench, while LLaDA2.1-Mini (16 billion parameters) reaches peak speeds exceeding 1,500 tokens per second. The system incorporates the first large-scale reinforcement learning framework for diffusion language models, using ELBO-based Block-level Policy Optimization to improve reasoning and instruction-following capabilities across 33 benchmarks. (arXiv)

Anthropic tracks how users grant autonomy to Claude agents

Anthropic analyzed millions of interactions across Claude Code and its public API to measure how much independence people grant AI agents in practice. The longest Claude Code sessions nearly doubled in duration over three months, from under 25 minutes to over 45 minutes of continuous work, an increase that suggests existing models can handle more autonomy than they currently exercise. Experienced users grant Claude more independence by enabling auto-approve in over 40 percent of sessions (compared to 20 percent for new users) but also interrupt more frequently, indicating a shift from reviewing each action to monitoring and intervening when needed. Claude Code stops to ask for clarification more than twice as often as humans interrupt it on complex tasks. On Anthropic’s public API, software engineering accounts for nearly 50 percent of agent activity, with emerging use in healthcare, finance, and cybersecurity, though most actions remain low-risk and reversible. (Anthropic)


Want to know more about what matters in AI right now? 

Read the latest issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng talked about AI’s potential to create new job opportunities, the moral responsibility to support those affected by AI-driven changes, and the growing demand for digital services and custom software development. 

“Many people are worried about AI taking peoples’ jobs. As a society we have a moral responsibility to take care of people whose livelihoods are harmed. At the same time, I see many opportunities for people to take on new jobs and grow their areas of responsibility.”

Read Andrew’s letter here.

Other top AI news and research stories covered in depth:

  • GLM-5 Scaled Up as Z.ai’s updated model achieved the top open-weights Intelligence Index score, setting a new benchmark in AI capabilities.
  • Big AI Spent Big On Lobbying with Meta, Amazon, Microsoft, Google, and Nvidia investing millions to influence government policies and regulations.
  • Faster Reasoning at the Edge Was Achieved by Liquid AI’s innovative model that combined attention with convolutional layers for enhanced efficiency.
  • Sleep Signals Predicted Illness through SleepFM, which detected signs of neurological disorders years before symptoms manifested, offering a breakthrough in preventative healthcare.

A special offer for our community

DeepLearning.AI recently launched the first-ever subscription plan for our entire course catalog! As a Pro Member, you’ll immediately enjoy access to:

  • Over 150 AI courses and specializations from Andrew Ng and industry experts
  • Labs and quizzes to test your knowledge
  • Projects to share with employers
  • Certificates to testify to your new skills
  • A community to help you advance at the speed of AI

Enroll now to lock in a year of full access for $25 per month paid upfront, or opt for month-to-month payments at just $30 per month. Both payment options begin with a one week free trial. Explore Pro’s benefits and start building today!

Try Pro Membership

Share

Subscribe to Data Points

Your accelerated guide to AI news and research