In today’s edition of Data Points, you’ll learn more about:
- Hugging Face’s community approach to benchmarks
- Four cloud companies’ plans to spend $650 billion this year
- A new proof of an old math problem
- Frontier, OpenAI’s enterprise agent system
But first:
Claude Opus 4.6 expands context, improves coding and retrieval
Anthropic released Claude Opus 4.6, featuring a one million token context window in beta — a first for its Opus-class models. The model shows improved coding abilities, including better planning, sustained agentic task performance, larger codebase navigation, and enhanced code review and debugging. It achieved state-of-the-art scores on several evaluations, including the highest score on Terminal-Bench 2.0 agentic coding benchmark, leading performance on Humanity’s Last Exam, and outperforming GPT-5.2 by 144 Elo points on GDPval-AA, which measures economically valuable knowledge work tasks. On the MRCR v2 needle-in-a-haystack benchmark testing long-context retrieval, Opus 4.6 scored 76 percent compared to Sonnet 4.5’s 18.5 percent, representing what Anthropic calls a qualitative shift in usable context length. Anthropic introduced new API features including adaptive thinking (where the model adjusts reasoning depth based on context), effort controls for balancing intelligence against speed and cost, and compaction for summarizing context in longer tasks. Pricing remains unchanged at $5 per million input tokens and $25 per million output tokens, making it 60 percent cheaper than comparable flagship models while delivering superior performance on knowledge work benchmarks. (Anthropic)
Codex update is OpenAI’s most capable agentic model to date
OpenAI introduced GPT-5.3-Codex, combining improved coding performance with reasoning capabilities and running 25 percent faster. The model achieved impressive results on multiple benchmarks, including 56.8% on SWE-Bench Pro and 77.3% on Terminal-Bench 2.0. GPT-5.3-Codex was instrumental in accelerating its own development — the first instance where this model substantially contributed to creating itself. Beyond code generation, GPT-5.3-Codex can autonomously build complex games, generate production-ready websites, and handle professional knowledge work tasks across 44 occupations. It is the first OpenAI model trained to identify software vulnerabilities, with comprehensive safeguards including safety training and trusted access restrictions. The model is available now through paid ChatGPT subscriptions. (OpenAI)
Hugging Face decentralizes model benchmarking
Hugging Face introduced Community Evals to address gaps in the current evaluation system, where benchmark scores sometimes don’t correlate with real-world model performance and different sources report varying results. The new system allows users to submit evaluation results for models through pull requests, making the evaluation process transparent and reproducible. Benchmark datasets (like MMLU-Pro and GPQA) now host leaderboards that collect community-submitted scores alongside official ones, with Git history documenting when evaluations were added and modified. All evaluation data is accessible through public APIs, enabling researchers and developers to build custom dashboards and analysis tools. The system provides visibility into the evaluation process, though it does not directly address benchmark saturation or close the gap between test scores and production performance. (Hugging Face)
Data center spending surges behind AI growth, hardware scarcity
Alphabet, Amazon, Meta, and Microsoft collectively forecast capital expenditures of approximately $650 billion in 2026, representing a 60 percent year-over-year increase. The spending surge is driven primarily by data center construction to support AI model training and inference workloads. This scale of investment reflects intensifying competition among hyperscalers to secure computing capacity for large language models and generative AI applications. The capex commitment signals that infrastructure investment, not just model development, has become the primary capital allocation focus for major tech companies. With spending at this magnitude, data center capacity constraints that have driven recent GPU shortages and colocation demand will likely persist as limiting factors on AI deployment timelines. (Introl)
AI system solves math conjectures that stumped experts for years
Axiom’s AxiomProver solved four previously unsolved mathematical problems, including a five-year-old conjecture in algebraic geometry and Fel’s Conjecture involving formulas from mathematician Srinivasa Ramanujan’s century-old notebook. The system combines large language models with proprietary AI trained to reason through problems and verify solutions using Lean, a specialized mathematical language that enables it to develop novel proofs rather than search existing literature. AxiomProver independently generated complete proofs for some problems and identified missing connections in others, with all solutions verified and posted to arXiv. Beyond mathematics, Axiom plans to apply the technology to cybersecurity, using formal verification to develop provably reliable and trustworthy code. (arXiv and Wired)
OpenAI wants to help big businesses wrangle their agents
OpenAI released Frontier, a platform designed to help enterprises build, deploy, and manage AI agents across business functions. The platform functions as a centralized interface that creates shared context for agents, allowing them to operate across different environments while setting clear permission boundaries and security controls suitable for regulated industries. Frontier connects fragmented tools and siloed data, enabling human teams to “hire” AI agents for tasks like code execution and data analysis while building persistent memories that improve agent usefulness over time. The platform is currently available to a limited set of early customers including Intuit, State Farm, Thermo Fisher, and Uber, with broader availability planned over the coming months. OpenAI declined to disclose pricing at launch. The move positions Frontier as a direct competitor to Microsoft’s Agent 365 and Anthropic’s Claude Cowork, reflecting industry-wide efforts to monetize AI agents as core revenue-generating products for enterprise customers. (OpenAI)
Want to know more about what matters in AI right now?
Read the latest issue of The Batch for in-depth analysis of news and research.
Last week, Andrew Ng talked about the evolving job market influenced by AI, emphasizing that while AI-related job losses were minimal, the demand for AI skills was reshaping employment opportunities, with workers who adapted to AI becoming more valuable.
“Instead, a common refrain applies: AI won’t replace workers, but workers who use AI will replace workers who don’t. For instance, because AI coding tools make developers much more efficient, developers who know how to use them are increasingly in-demand.”
Read Andrew’s letter here.
Other top AI news and research stories covered in depth:
- Agents Unleashed dives into the reality behind the hype surrounding OpenClaw and Moltbook, offering perspective on their capabilities and weaknesses.
- Kimi K2.5 Creates Its Own Workforce explains how Moonshot AI took the open-model crown, leveraging subagents for enhanced performance.
- AI Giants Share Wikipedia’s Costs reported that the Wikimedia Foundation secured partnerships with Amazon, Meta, Microsoft, Mistral AI, and Perplexity, among others, exchanging financial support for enhanced data access.
- Mistral employed cascade distillation on Mistral 3 to build the Ministral family, showing a novel approach to creating smaller yet capable AI models.
A special offer for our community
DeepLearning.AI recently launched the first-ever subscription plan for our entire course catalog! As a Pro Member, you’ll immediately enjoy access to:
- Over 150 AI courses and specializations from Andrew Ng and industry experts
- Labs and quizzes to test your knowledge
- Projects to share with employers
- Certificates to testify to your new skills
- A community to help you advance at the speed of AI
Enroll now to lock in a year of full access for $25 per month paid upfront, or opt for month-to-month payments at just $30 per month. Both payment options begin with a one week free trial. Explore Pro’s benefits and start building today!