Alibaba outdoes itself with latest open models Chatbot Arena faces accusations of unequal access

Data Points

Published

May 2, 2025

Reading time

4 min read

In today’s edition, you’ll learn more about:

Microsoft’s Phi-4 updates add reasoning to small models
OpenAI rolls back cloying update to GPT-4o, explains what went wrong
Amazon debuts its largest multimodal teacher/agent model yet
Meta partners with fast inference providers for new official API

But first:

Alibaba debuts Qwen3 language models with hybrid reasoning

Alibaba released Qwen3, a new family of large language models that support 119 languages and dialects. The family includes the flagship Qwen3-235B-A22B with 235 billion parameters and a smaller Qwen3-30B-A3B model, along with six dense models of various sizes. The models feature a hybrid approach that allows users to toggle between a deliberate “thinking mode” for complex reasoning and a faster “non-thinking mode” for simpler queries. Qwen3 models were trained on 36 trillion tokens — nearly double the training data of their predecessor — and boast better performance in coding, math, and other capabilities than competitors like DeepSeek-R1 and Gemini-2.5-Pro. All Qwen 3 models are open-weights and immediately available under the Apache 2.0 license on platforms including Hugging Face, ModelScope, and Kaggle. (Qwen Blog / GitHub)

Study claims Chatbot Arena gave big tech companies unfair advantages

Authors from Cohere, Stanford, MIT, and Ai2 accused LM Arena of allowing select AI companies to privately test multiple model variants on its Chatbot Arena benchmark while only publishing scores for their best performers. The paper alleges that companies including Meta, OpenAI, Google, and Amazon received preferential treatment that helped them achieve higher leaderboard rankings compared to competitors who weren’t offered the same opportunity. According to the study, Meta tested 27 model variants privately before its Llama 4 release but only publicly revealed the score for its top-performing model. Chatbot Arena has disputed these claims, calling the study full of “inaccuracies” and “questionable analysis,” while maintaining that its leaderboard is committed to fair evaluations and that all model providers are welcome to submit more tests. (arXiv and TechCrunch)

Microsoft releases new Phi-4 reasoning models

Microsoft launched three new language models: Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning. The 14 billion parameter model Phi-4-reasoning-plus outperforms larger competitors when answering mathematical problems and scientific questions, including beating DeepSeek-R1 (671 billion parameters) on the 2025 USA Math Olympiad qualifier test. The open-weights models are available on Azure AI Foundry and Hugging Face, with versions for Copilot+ PCs planned for a future release. (Microsoft)

OpenAI rolls back sycophantic GPT-4o update

OpenAI reverted an April 25th update to GPT-4o that made the model excessively agreeable, particularly when validating users’ negative emotions. The update combined several changes that weakened the model’s primary reward signal, including a new signal based on user feedback that likely amplified this behavior. Despite positive results in offline evaluations and limited A/B testing, the company failed to adequately weigh qualitative concerns from expert testers who noticed the model’s behavior “felt slightly off.” OpenAI says it has implemented several new safeguards, including treating model behavior issues as launch-blocking concerns, introducing an “alpha” testing phase, and committing to more proactive communication about model updates. (OpenAI)

Amazon releases Nova Premier multimodal model

Amazon Web Services made Nova Premier generally available in Amazon Bedrock, adding to its existing Nova model family. Nova Premier (billed as Amazon’s largest model, but total parameter count unspecified) inputs text, images, and videos with a one million token context window and outputs text. AWS benchmarked Nova Premier using 17 different metrics, where it outperformed other Nova models and also matched competing models like Claude 3.7 Sonnet and GPT-4.5 in about half the evaluations. Developers can use Nova Premier as a teacher model to distill its capabilities into smaller, faster models or use it in conjunction with these smaller models for agentic workflows. Nova Premier is now available in three AWS regions for $2.50/$12.50 per million input/output tokens. (Amazon)

Meta launches Llama API with one-click key creation and model playgrounds

Meta announced a limited free preview of a new Llama API. Meta’s new developer site offers API key creation, interactive playgrounds for exploring Llama models, and tools for fine-tuning and evaluating custom versions of the company’s Llama 3.3 8B model. Meta emphasized that user prompts and responses won’t be used to train their AI models, and developers can export custom models rather than being locked to Meta’s servers. The company also announced collaborations with Cerebras and Groq for faster inference speeds. Access to Llama 4 models powered by these providers is now available by request. (Meta)

Still want to know more about what matters in AI right now?

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng highlighted an inspiring story of a high school basketball coach who learned to code and now teaches computer science, emphasizing how AI can help scale K-12 education by empowering both students and teachers.

“Starting from K-12, we should teach every student AI-enabled coding, since this will enable them to become more productive and more empowered adults. But there is a huge shortage of computer science (CS) teachers… Whereas AI can directly deliver personalized advice to students, the fact that it is now helping teachers also deliver personalized support will really help in K-12.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: OpenAI launched API access to GPT Image 1, the image generator behind viral ChatGPT uploads; Google updated its AI-powered music generation tools, targeting professional musicians and creators; CB Insights’ Top 100 AI Startups list identified emerging players focused on AI agents and infrastructure; and researchers showed how large language models can improve shopping recommendations by inferring customer preferences from natural language input.

Subscribe to Data Points