Microsoft delays Recall Plus, Nvidia leads new MLPerf benchmarks

Published

Jun 17, 2024

Reading time

4 min read

This week’s top AI stories include:

•             Personalized image generation
•             A second look at the MMLU benchmark
•             A new method to reduce LLM hallucinations
•             A closer look inside Apple’s AI cloud security

But first:

Microsoft delays Recall feature for new Copilot Plus PCs
Instead of launching the feature with the new PCs, Microsoft will use the Windows Insider software preview community to thoroughly test Recall and ensure it meets quality and security standards before making it widely available. Recall employs on-device AI models integrated into Windows 11 to capture screenshots of nearly all user activities and provide a searchable database for users to find previously viewed content. However, initial versions of the database stored the material in insecure plaintext, raising concerns from privacy advocates and security experts about potential cybersecurity risks associated with the feature. (The Verge)

MLCommons announces MLPerf Training v4.0 results with new benchmarks
The MLPerf suite introduces two new benchmarks: LoRA fine-tuning of LLama 2 70B and a graph neural network (GNN) benchmark for node classification. The LoRA benchmark measures techniques to reduce computational costs for fine-tuning large language models, while the GNN benchmark measures performance on graph-structured data used in social network analysis, fraud detection, and other applications. Unsurprisingly, Nvidia’s H100 chips lead 205 performance results from 17 organizations, with Google’s TPUs just behind. (MLCommons)

MMLU-Redux: Identifying and correcting errors in the MMLU dataset
Researchers identified numerous errors in the popular Massive Multitask Language Understanding (MMLU) benchmark dataset, which is used to evaluate the performance of Large Language Models (LLMs). To correct these errors, they manually re-annotated 3,000 questions across 30 subsets of MMLU, creating MMLU-Redux. The re-evaluation of leading LLMs using MMLU-Redux revealed notable changes in their performance metrics and rankings, highlighting the impact of dataset errors on model evaluation. Correcting the virology subset produced the largest changes in the metrics, with many models going from 50 percent accuracy to over 90 percent accuracy, and the Palmyra X v3 model going from fourth place to first. (ArXiv)

Lamini introduces Memory Tuning to reduce hallucinations and improve factual accuracy
By tuning millions of expert LoRA adapters with precise facts on top of open-source LLMs, memory tuning enabled 95% accuracy on critical use cases where previous approaches peaked at 50%. The resulting sparsely activated Mixture of Memory Experts (MoME) model allows for an extremely high number of parameters and facts to be learned, while keeping computational cost fixed at inference time. This method allows companies to automate tasks with higher precision, lower costs, and faster development cycles compared to traditional fine-tuning methods. (Lamini)

Midjourney adds personalized image generation
Midjourney now allows users to create personalized images by ranking image pairs on its website. By adding the --p or --personalize parameter to prompts, the AI will generate images tailored to the user’s preferences as determined by their pair rankings. Users can apply their own personalization by default or use another user’s by including their shortcode, and adjust the amount of personalization with the --stylize parameter. Personalization continues a trend in AI development enabling a personal or house-defined style for automatically generated content. (Midjourney)

Apple unveils Private Cloud Compute for secure cloud-based AI
Private Cloud Compute (PCC) processes user data in the cloud without exposing it to anyone, including Apple, and deletes the data after completing the task. The system uses custom hardware, a hardened operating system, and various security measures to protect user data and enable independent security researchers to verify its privacy claims. Apple plans to make PCC software images publicly available for security research within 90 days of inclusion in their transparency log, allowing researchers to inspect the software, verify its functionality, and identify potential issues. Apple clearly aims to compete in AI on cloud security and user privacy; it remains to be seen how other technology companies will respond. (Apple)

Still want to know more about what matters in AI right now?

If you missed it, read last week’s issue of The Batch for in-depth analysis of news and research.

This week, Andrew Ng discussed agentic design and inclusive work in the AI community:

“More and more people are building systems that prompt a large language model multiple times using agent-like design patterns. But there’s a gray zone between what clearly is not an agent (prompting a model once) and what clearly is (say, an autonomous agent that, given high-level instructions, plans, uses tools, and carries out multiple, iterative steps of processing). Rather than arguing over which work to include or exclude as being a true agent, we can acknowledge that there are different degrees to which systems can be agentic. Then we can more easily include everyone who wants to work on agentic systems.”

Read Andrew's full letter here.

Other top AI news and research stories we covered in depth included everything about Apple’s Gen AI strategy, Stability AI's enhanced text-to-audio generator, the results from the AI Seoul Summit and the AI Global Forum, and Google's AMIE, a chatbot that outperformed doctors in diagnostic conversations.

Subscribe to Data Points