How top models perform on a challenging new benchmark Plus, Open AI’s latest agreements with news publishers

Jun 3, 2024
Reading time
4 min read
How top models perform on a challenging new benchmark: Plus, Open AI’s latest agreements with news publishers

This week's top AI stories feature a new model from Mistral designed for coding, Microsoft’s updated Phi-3 family of small language models, OpenAI’s new safety and security team, and a family of models from Cohere that supports 23 global languages:

Codestral is Mistral’s open-weight model for code
Codestral is a 22 billion parameter model trained on English and over 80 programming languages, including Python, Java, C++, Fortran, and Swift. Codestral handily outperforms CodeLlama 70B on multiple benchmarks, including HumanEval and RepoBench, and is competitive with DeepSeek Coder 30B. The model is open-weight, but available for download only under a noncommercial use license, limiting its incorporation into other software. (Mistral)

OpenAI signs agreements with News Corp., Vox Media, and The Atlantic
The multiyear partnerships give OpenAI’s models and ChatGPT access to large, regularly-updated sources of news and opinion that it can display in response to user queries, along with attribution and links to full articles. The Vox and Atlantic deals also include access to OpenAI’s technologies to develop their own experimental AI and data products. OpenAI’s deals follow similar ones with Reddit, Stack Overflow, Le Monde, and other social media sites and news sources, as well as those made by Google and other AI companies. (The Wall Street Journal)

MMLU-Pro: A more challenging benchmark for large language models
MMLU-Pro is a new dataset from TIGER-Lab that aims to more rigorously test the capabilities of large language models across various disciplines. It builds upon the original MMLU dataset but increases the number of answer options to 10, incorporates more reasoning-focused problems, and adds over 5,000 new questions sourced from STEM websites, TheoremQA, and Scibench. GPT-4o remains at the top of the MMLU-Pro leaderboard, followed by Claude 3 Opus and Gemini 1.5 Flash, but some models like Mixtral-8x7B saw their scores drop by over 30 percent on the new benchmark. (Hugging Face)

Microsoft’s Phi-3 small language models now generally available
Microsoft announced the addition of Phi-3-Vision, a 4.2 billion parameter multimodal model combining language and vision capabilities, to its Phi-3 family of small, open models. The company also made Phi-3-Small and Phi-3-Medium available on Microsoft Azure, while Phi-3-Mini and Phi-3-Medium are now accessible through Azure AI’s models as a service offering. Phi-3-Silica is a separate model in the family that powers AI features on Windows’ new Copilot+ PCs; familiarity with the Phi family may help Windows developers looking to add these features to their applications. (Microsoft)

Cohere releases Aya 23, an open-weight multilingual language model
Building on Cohere’s Command and Aya 101 models, Aya 23 covers 23 European and Asian languages, including Arabic, Chinese (simplified & traditional), Hebrew, Hindi, Indonesian, Japanese, Korean, Persian, Turkish, and Vietnamese. Unlike Aya 101, which attempted breadth of coverage with 101 languages, Aya 23 aims to balance breadth and depth, outperforming Aya 101 and other open models like Gemma and Mistral on a wide range of generative and reasoning tasks. Cohere has made 8 billion and 35 billion parameter versions of the model available for noncommercial use in an attempt to further global research and development of massively multilingual models. (Cohere for AI)

OpenAI establishes safety team amid concerns from departing researchers
OpenAI’s new Safety and Security Committee, led by CEO Sam Altman and board members Adam D’Angelo, Nicole Seligman, and Bret Taylor, will address critical safety and security decisions for the company’s projects and operations. The committee, which will also include a ranger of technical and policy experts, will take 90 days to evaluate OpenAI’s processes and safeguards, presenting its findings to the board for implementation. The safety committee’s formation comes after the departure of several key researchers, including co-founder and chief scientist Ilya Sutskever and Superalignment team co-leader Jan Leike, who expressed concerns about safety taking a backseat to product development at OpenAI. (OpenAI)

Still want to know more about what matters in AI right now? 

Read this week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed why we need better evals for LLM applications:

“The cost of running evals poses an additional challenge. Let’s say you’re using an LLM that costs $10 per million input tokens, and a typical query has 1000 tokens. Each user query therefore costs only $0.01. However, if you iteratively work to improve your algorithm based on 1000 test examples, and if in a single day you evaluate 20 ideas, then your cost will be 20*1000*0.01 = $200. For many projects I’ve worked on, the development costs were fairly negligible until we started doing evals, whereupon the costs suddenly increased. (If the product turned out to be successful, then costs increased even more at deployment, but that was something we were happy to see!)"

Read Andrew's full letter here.

Other top AI news and research stories we covered in depth included a deep learning model that significantly reduced deaths among critically ill hospital patients, the Indian startups that are testing autonomous vehicles on their nation’s disorderly local roads, a new report from Microsoft and LinkedIn on knowledge workers' adoption of AI, and all about RAPTOR, a recursive summarizer and retrieval system for LLMs. 


Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox