In today’s edition of Data Points, you’ll learn more about:
- Stability AI’s limited wins in Getty copyright suit
- Kosmos’s new generalist scientific research agent
- German Commons, a big open dataset for training AI models
- Google’s experiments putting satellites with AI chips in space
But first:
Huge real-world datasets may establish new robotics scaling laws
Generalist AI introduced GEN-0, a class of embodied foundation models trained directly on physical interaction data that demonstrates predictable scaling laws similar to those in large language models. The company trained GEN-0 on over 270,000 hours of real-world manipulation data — orders of magnitude more than existing robotics datasets — and observed a phase transition at 7 billion parameters where smaller models exhibited ossification (inability to absorb new information) while larger models continued to improve. The models use a training approach that enables simultaneous thinking and acting by processing asynchronous streams of sensing and action tokens, and work across different robot configurations including six-, seven-, and 16+-degree-of-freedom semi-humanoid robots. The research demonstrates that pretraining data follows a power-law scaling relationship with downstream task performance, allowing researchers to predict how much data is needed to reach specific performance levels. (Generalist AI)
Amazon releases Chronos-2, a universal forecasting model
Chronos-2 can forecast single time series, multiple related time series, and time series influenced by external factors, all without needing extra training. The model uses in-context learning and a group attention feature to understand how different time series relate to each other and to factor in outside influences like weather or sales promotions. Amazon trained Chronos-2 on synthetic data since real-world datasets with complex relationships between variables are hard to find. Chronos-2 beat existing forecasting models by wide margins on two major benchmarks, winning over 90 percent of head-to-head comparisons against its predecessor, Chronos-Bolt. The model’s weights are now openly available, and earlier versions have been downloaded over 600 million times from Hugging Face. (Amazon)
Stability AI wins limited copyright judgment in image scraping case
Getty Images largely lost its lawsuit against Stability AI in Britain’s High Court, though it narrowly won on trademark infringement claims. The image library company had accused Stability of scraping 12 million images from its website without permission to train the Stable Diffusion image generator, but Getty dropped its primary copyright claims during the trial and lost its secondary copyright arguments. The judge ruled that Stable Diffusion doesn’t infringe copyright because it doesn’t store or reproduce copyrighted works, but said Getty’s watermark appearing on some generated images constituted trademark infringement. Legal experts say the case leaves key questions about AI training and copyright unanswered, since Getty abandoned key claims before the judge could rule on whether using copyrighted material to train AI models is lawful. Getty is pursuing a separate copyright lawsuit against Stability in U.S. federal court. (Associated Press)
Kosmos automates scientific research across multiple disciplines
Edison Scientific authors introduced Kosmos, an AI system that automates data-driven discovery by performing iterative cycles of literature search, data analysis, and hypothesis generation. Given a dataset and research objective, Kosmos writes an average of 42,000 lines of code and reads 1,500 scientific papers per run—a nearly tenfold increase over previous systems. The authors listed seven discoveries, including identifying a clinically relevant mechanism of neuronal aging and generating statistical evidence that superoxide dismutase 2 may causally reduce myocardial fibrosis in humans. Expert evaluators found 79 percent of statements in Kosmos reports accurate, with 85 percent of data analysis-based statements reproducible, though the system showed limitations in interpretive statements. AI researchers in related fields may find Kosmos valuable since it demonstrates how structured world models can coordinate hundreds of agent rollouts to perform what experts estimated as more than six months of research work. (arXiv)
Massive open corpus of German text developed for AI training
Researchers released the German Commons, the largest collection of openly licensed German text to date, comprising 154 billion tokens across 35.78 million documents from 40 institutional sources. The corpus draws from seven domains — web, political, legal, news, economic, cultural, and scientific — with all texts carrying verifiable licenses of at least CC-BY-SA 4.0. Processing included OCR-specific filtering for historical documents, deduplication, and removal of personal or toxic information. The release helps developers build German language models without the legal and ethical barriers posed by web crawls, providing commercially usable training data with verifiable provenance through document-level license metadata. The corpus and processing code are available on Hugging Face and GitHub. (arXiv)
Google tests AI infrastructure in space with solar-powered satellites
Google announced Project Suncatcher, a research initiative investigating whether constellations of solar-powered satellites equipped with TPUs could one day scale machine learning compute in space. The company published a preprint paper detailing early progress on challenges including high-bandwidth communication between satellites, orbital dynamics, and radiation effects on computing hardware. Google’s team achieved 1.6 terabits per second transmission in a bench-scale demonstration and found that Trillium TPUs withstood radiation levels nearly three times higher than expected five-year mission doses. The research suggests that if launch costs continue declining to around $200 per kilogram by the mid-2030s, space-based data centers could become economically comparable on a per-kilowatt basis to Earth-bound facilities. Google plans to launch two prototype satellites in partnership with Planet by early 2027 to test the concepts in orbit. (Google)
A special offer for our community
DeepLearning.AI just launched the first-ever subscription plan for our entire course catalog! As a Pro Member, you’ll immediately enjoy access to:
- Over 150 AI courses and specializations from Andrew Ng and industry experts
- Labs and quizzes to test your knowledge
- Projects to share with employers
- Certificates to testify to your new skills
- A community to help you advance at the speed of AI
Enroll now to lock in a year of full access for $25 per month paid upfront, or opt for month-to-month payments at just $30 per month. Both payment options begin with a one week free trial. Explore Pro’s benefits and start building today!
Still want to know more about what matters in AI right now?
Read this week’s issue of The Batch for in-depth analysis of news and research.
This week, Andrew Ng talked about the importance of controlling your own data to leverage AI agents effectively, challenges posed by SaaS vendors creating data silos, and the increasing value of organized unstructured data.
“Because of AI’s growing capabilities, the value you can now create from ‘connecting the dots’ between different pieces of data is higher than ever. For example, if an email click is logged in one vendor’s system and a subsequent online purchase is logged in a different one, then it is valuable to build agents that can access both of these data sources to see how they correlate to make better decisions.”
Read Andrew’s full letter here.
Other top AI news and research stories we covered in depth:
- OpenAI has completed a restructuring, freeing it to go public and make deals with new partners, marking a significant milestone.
- MiniMax-M2 emerges as a leader in open-weights coding, offering top performance with a lightweight footprint and low costs.
- Universal Music Group and music generator Udio have struck a deal to settle a lawsuit and build a new platform to remix copyrighted music, signaling a new embrace of AI by the music industry.
- Google researchers released VaultGemma, an open-weights model designed to redact personal information, enhancing privacy in AI training sets.