Claude Opus 4.6 pushes the envelope GPT-5.3-Codex shines on agentic coding and game building

Published
Feb 9, 2026
Reading time
5 min read
Two retro robots playing ping pong in a futuristic room with scoreboard showing Claude: 15, Codex: 14.

In today’s edition of Data Points, you’ll learn more about:

  • Hugging Face’s community approach to benchmarks
  • Four cloud companies’ plans to spend $650 billion this year
  • A new proof of an old math problem
  • Frontier, OpenAI’s enterprise agent system

But first:

Claude Opus 4.6 expands context, improves coding and retrieval

Anthropic released Claude Opus 4.6, featuring a one million token context window in beta — a first for its Opus-class models. The model shows improved coding abilities, including better planning, sustained agentic task performance, larger codebase navigation, and enhanced code review and debugging. It achieved state-of-the-art scores on several evaluations, including the highest score on Terminal-Bench 2.0 agentic coding benchmark, leading performance on Humanity’s Last Exam, and outperforming GPT-5.2 by 144 Elo points on GDPval-AA, which measures economically valuable knowledge work tasks. On the MRCR v2 needle-in-a-haystack benchmark testing long-context retrieval, Opus 4.6 scored 76 percent compared to Sonnet 4.5’s 18.5 percent, representing what Anthropic calls a qualitative shift in usable context length. Anthropic introduced new API features including adaptive thinking (where the model adjusts reasoning depth based on context), effort controls for balancing intelligence against speed and cost, and compaction for summarizing context in longer tasks. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, making it 60 percent cheaper than comparable flagship models while delivering superior performance on knowledge work benchmarks. (Anthropic)

Codex update is OpenAI’s most capable agentic model to date

OpenAI introduced GPT-5.3-Codex, combining improved coding performance with reasoning capabilities and running 25 percent faster. The model achieved impressive results on multiple benchmarks, including 56.8% on SWE-Bench Pro and 77.3% on Terminal-Bench 2.0. GPT-5.3-Codex was instrumental in accelerating its own development — the first instance where this model substantially contributed to creating itself. Beyond code generation, GPT-5.3-Codex can autonomously build complex games, generate production-ready websites, and handle professional knowledge work tasks across 44 occupations. It is the first OpenAI model trained to identify software vulnerabilities, with comprehensive safeguards including safety training and trusted access restrictions. The model is available now through paid ChatGPT subscriptions. (OpenAI)

Hugging Face decentralizes model benchmarking

Hugging Face introduced Community Evals to address gaps in the current evaluation system, where benchmark scores sometimes don’t correlate with real-world model performance and different sources report varying results. The new system allows users to submit evaluation results for models through pull requests, making the evaluation process transparent and reproducible. Benchmark datasets (like MMLU-Pro and GPQA) now host leaderboards that collect community-submitted scores alongside official ones, with Git history documenting when evaluations were added and modified. All evaluation data is accessible through public APIs, enabling researchers and developers to build custom dashboards and analysis tools. The system provides visibility into the evaluation process, though it does not directly address benchmark saturation or close the gap between test scores and production performance. (Hugging Face)

Data center spending surges behind AI growth, hardware scarcity

Alphabet, Amazon, Meta, and Microsoft collectively forecast capital expenditures of approximately $650 billion in 2026, representing a 60 percent year-over-year increase. The spending surge is driven primarily by data center construction to support AI model training and inference workloads. This scale of investment reflects intensifying competition among hyperscalers to secure computing capacity for large language models and generative AI applications. The capex commitment signals that infrastructure investment, not just model development, has become the primary capital allocation focus for major tech companies. With spending at this magnitude, data center capacity constraints that have driven recent GPU shortages and colocation demand will likely persist as limiting factors on AI deployment timelines. (Introl)

AI system solves math conjectures that stumped experts for years

Axiom’s AxiomProver solved four previously unsolved mathematical problems, including a five-year-old conjecture in algebraic geometry and Fel’s Conjecture involving formulas from mathematician Srinivasa Ramanujan’s century-old notebook. The system combines large language models with proprietary AI trained to reason through problems and verify solutions using Lean, a specialized mathematical language that enables it to develop novel proofs rather than search existing literature. AxiomProver independently generated complete proofs for some problems and identified missing connections in others, with all solutions verified and posted to arXiv. Beyond mathematics, Axiom plans to apply the technology to cybersecurity, using formal verification to develop provably reliable and trustworthy code. (arXiv and Wired)

OpenAI wants to help big businesses wrangle their agents

OpenAI released Frontier, a platform designed to help enterprises build, deploy, and manage AI agents across business functions. The platform functions as a centralized interface that creates shared context for agents, allowing them to operate across different environments while setting clear permission boundaries and security controls suitable for regulated industries. Frontier connects fragmented tools and siloed data, enabling human teams to “hire” AI agents for tasks like code execution and data analysis while building persistent memories that improve agent usefulness over time. The platform is currently available to a limited set of early customers including Intuit, State Farm, Thermo Fisher, and Uber, with broader availability planned over the coming months. OpenAI declined to disclose pricing at launch. The move positions Frontier as a direct competitor to Microsoft’s Agent 365 and Anthropic’s Claude Cowork, reflecting industry-wide efforts to monetize AI agents as core revenue-generating products for enterprise customers. (OpenAI)


Want to know more about what matters in AI right now?

Read the latest issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng talked about the evolving job market influenced by AI, emphasizing that while AI-related job losses were minimal, the demand for AI skills was reshaping employment opportunities, with workers who adapted to AI becoming more valuable.

“Instead, a common refrain applies: AI won’t replace workers, but workers who use AI will replace workers who don’t. For instance, because AI coding tools make developers much more efficient, developers who know how to use them are increasingly in-demand.”

Read Andrew’s letter here.

Other top AI news and research stories covered in depth:


A special offer for our community

DeepLearning.AI recently launched the first-ever subscription plan for our entire course catalog! As a Pro Member, you’ll immediately enjoy access to:

  • Over 150 AI courses and specializations from Andrew Ng and industry experts
  • Labs and quizzes to test your knowledge
  • Projects to share with employers
  • Certificates to testify to your new skills
  • A community to help you advance at the speed of AI

Enroll now to lock in a year of full access for $25 per month paid upfront, or opt for month-to-month payments at just $30 per month. Both payment options begin with a one week free trial. Explore Pro’s benefits and start building today!

Try Pro Membership

Share

Subscribe to Data Points

Your accelerated guide to AI news and research