Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:
- Nvidia gives Project DIGITS a new name
- AI models compete to build Minecraft items
- Claude chatbot now includes search
- A Moore’s law-like regularity for AI agents
But first:
AI model surpasses traditional weather forecasting systems
Researchers at Cambridge University, Microsoft, Google, and other institutions developed Aardvark Weather, an end-to-end machine learning model that outperforms traditional numerical weather prediction systems for global and local forecasts. The model ingests raw observational data and produces accurate forecasts up to ten days in advance, competing with state-of-the-art systems that incorporate human input. The model’s high accuracy shows the potential for fully data-driven weather prediction to significantly reduce computational costs and enable customized forecasting models for individual users or smaller nations. (Nature)
OpenAI releases new speech models
OpenAI debuted new speech-to-text models (gpt-4o-transcribe and gpt-4o-mini-transcribe) and a text-to-speech model (gpt-4o-mini-tts) that outperform current Whisper models on tests of accuracy and reliability. The speech-to-text models demonstrate improved Word Error Rate performance across multiple benchmarks, while the text-to-speech model allows developers to instruct it to speak in specific ways (like a storytelling pirate, or a calm customer service representative). OpenAI says they built the new speech-to-text and text-to-speech models using new distillation techniques to shrink large models like GPT-4o and reinforcement learning to improve transcription and voice generation accuracy. (OpenAI)
Nvidia unveils personal computers for AI developers
Nvidia CEO Jensen Huang introduced two new AI-focused desktop systems, DGX Spark (formerly known as Project DIGITS) and DGX Station (a larger model), during the company’s GTX keynote. The computers, powered by Nvidia’s Grace Blackwell platform, are designed to enable developers, researchers, and data scientists to run large AI models locally for prototyping and fine-tuning. Five major PC manufacturers, including Asus, Dell, HP, and Lenovo, will produce these systems, with DGX Spark reservations opening immediately and DGX Station expected later in 2025. (Ars Technica)
Minecraft emerges as novel AI benchmark tool
Developers led by 12th-grader Adi Singh created Minecraft Benchmark (MC-Bench), a website where AI models compete to build Minecraft creations by writing code based on prompts. Users vote on the best builds without knowing which AI produced them, providing a novel way to assess AI capabilities beyond traditional benchmarks. The site is built with subsidies from Anthropic, OpenAI, and Alibaba, but remains unaffiliated; currently Claude 3.7 Sonnet tops the leaderboard. MC-Bench’s approach tests coding ability, visual understanding, and problem solving in a way that leverages Minecraft’s widespread familiarity to make AI progress more accessible and understandable to the general public. (MC-Bench and TechCrunch)
Claude introduces web search
Anthropic added web search functionality to its AI chatbot Claude, allowing it to access up-to-date information and provide more relevant responses to queries. The feature is currently available in preview for paid U.S. users of Claude 3.7 Sonnet, with plans to expand to free users and other countries. This update enables Claude to incorporate current data from internet sources, providing inline citations in conversational, aggregated responses, similar to competitors like ChatGPT and Gemini. (Anthropic)
Identifying a new AI problem-solving progression law
Researchers at METR proposed a “50%-task-completion time horizon” metric to compare AI and human capabilities on various long-duration tasks. Current top AI models like Claude 3.7 Sonnet can complete tasks with 50 percent success that take skilled humans about 50 minutes, with this time horizon doubling roughly every seven months since 2019 – in other words, in seven months, we may expect a model to be able to complete a task halfway that takes humans 100 minutes, then 200 minutes, etc.. This metric offers AI developers a concrete way to measure progress in AI capabilities relative to human performance, potentially signaling that within five years, top AI agents may be able to automate tasks with 50 percent success that currently take skilled humans about a month to complete. (arXiv)
Still want to know more about what matters in AI right now?
Read this week’s issue of The Batch for in-depth analysis of news and research.
This week, Andrew Ng shared insights from AI Dev 25. He highlighted attendees’ strong interest in agentic AI and solving real-world problems over AGI hype. He also praised the event’s technical depth, emphasizing DeepLearning.AI’s “Learner First” mentality and the value of bringing developers together.
“With the wide range of AI tools now available, there is a rich set of opportunities for developers to build new things, but also a need for a neutral forum that helps developers do so.”
Read Andrew’s full letter here.
Other top AI news and research stories we covered in depth: Cohere’s Aya Vision outperformed multimodal rivals in text and image understanding, demonstrating fluency across a wide range of languages; AI Co-Scientist, Google’s new research agent, showed itself capable of generating hypotheses to aid drug discovery; the U.S. Copyright Office ruled that no new laws are needed to govern AI-generated works, noting the copyrightability of AI-assisted creations with sufficient human guidance; and MatterGen, a diffusion model, showcased its ability to design novel materials with tailored properties, advancing AI-driven material discovery.