Evaluating the best AI search engines Claude can read your Gmail and the web

Published
Apr 21, 2025
Reading time
4 min read
Crowd at sunset concert cheering for a giant server on stage, symbolizing AI or tech as a modern rockstar.

In today’s edition, you’ll learn more about:

  • New BitNet model shows that 1,0,-1 might be enough
  • Kling gets a model update with image and video inputs
  • Google rolls out video generation and animation model to subscribers
  • Music streamer Deezer notes sharp uptick in AI-generated music

But first:

Search Arena leaderboard weighs human preferences for AI-aided search

Search Arena, a new crowdsourced evaluation platform from LM Arena, measures human preference for search-augmented LLM systems using real-world queries and current events. Based on 7,000 human votes collected between March and April, Gemini-2.5-Pro-Grounding and Perplexity-Sonar-Reasoning-Pro tied for first place on the leaderboard, followed by other Perplexity Sonar models, Gemini-2.0-Flash-Grounding, and OpenAI’s web search API models. Analysis showed that three factors strongly correlated with human preference: longer responses, higher citation counts, and references to specific web sources like YouTube and online forums. The authors have open sourced their dataset and analysis code, with plans to expand the platform to include more model submissions and cross-task evaluations. (LM Arena)

Anthropic partners with Google on Research and Docs integration

Anthropic introduced two new features for Claude, both powered by Google, a key investor, its AI chatbot. Research allows Claude to search both internal work documents and the web, conducting multiple searches and automatically exploring different angles of a question to deliver answers with citations. The Google Workspace integration connects Claude to Gmail, Calendar, and Google Docs, enabling it to search emails, review documents, and access calendar information without requiring manual uploads. These features give Claude parity with other companies, including OpenAI, who offer Deep Research capabilities. Both are now available in early beta for paid plans in the United States, Japan, and Brazil, with Google Workspace integration accessible to all paid users whose admins have enabled the feature. (Anthropic)

Single-bit language model promises full power at a fraction of the cost

Microsoft released BitNet b1.58 2B4T, a native 1.58-bit large language model trained on 4 trillion tokens. The model matches the performance of similar-sized full-precision models across language understanding, math reasoning, coding, and conversational tasks, while dramatically reducing resource requirements. BitNet b1.58 uses just 0.4GB of memory compared to 2-4.8GB for comparable models, consumes up to 90 percent less energy, and offers faster inference speeds. Microsoft has made the model weights publicly available on Hugging Face along with optimized inference implementations for both GPU and CPU architectures. (arXiv)

Kling 2.0 adds multimodal inputs, improves video creation

Kuaishou Technology launched Kling AI 2.0 Master Edition, featuring a new multimodal visual language (MVL) approach that allows users to input images, video clips, and text rather than text alone. The company claims its models outperform competitors like Google Veo2 and Runway Gen-4 in internal tests, with significant advantages in semantic responsiveness, visual quality, and motion quality. The new model introduces editing capabilities that let users add, remove, or replace elements in AI-generated videos by inputting images or text prompts. Monthly subscription plans start at $10 a month for limited credits, ranging up to $92 a month for professional users. (Kling AI and Globe Newswire)

Google launches Veo 2 and Whisk for Gemini Advanced users

Google rolled out Veo 2, its updated video generation model, to U.S.-based Gemini Advanced users. Veo 2 enables users to create videos by providing detailed scene descriptions, with more specific prompts offering greater control over the final output. Whisk, a Google Labs experiment introduced in December, helps users visualize ideas using text and image prompts, and now includes Whisk Animate to turn images into videos using Veo 2. All generated videos include SynthID watermarking, and Google has implemented safety measures including red teaming and evaluations to prevent policy-violating content. The feature is now rolling out globally to Google One AI Premium subscribers across all Gemini-supported languages. (Google)

Music streaming service Deezer swamped with AI songs

Deezer revealed that 18 percent of songs uploaded to its platform are fully generated by AI, with more than 20,000 AI-generated tracks uploaded daily, nearly twice the amount reported four months ago. The French streaming service implemented a detection tool to filter these AI-created tracks from algorithmic recommendations for its 9.7 million subscribers. This surge in AI-generated music has triggered legal battles across the creative industry, with major labels like Universal, Warner, and Sony suing AI music tools Suno and Udio for alleged copyright infringement. (Reuters)


Still want to know more about what matters in AI right now?

Read last week’s issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng shared why teams should have started building evaluations early — even if they were quick and imperfect — and improved them over time to accelerate GenAI development.

“It’s okay to build quick evals that are only partial, incomplete, and noisy measures of the system’s performance, and to iteratively improve them.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: Google unveiled Gemini 2.5 Pro Experimental, which outperforms top AI models and continues the rapid evolution of its flagship model family; Model Context Protocol (MCP), an open standard for tool use and data access, gained traction as OpenAI adopted it to improve LLM integration with external tools and APIs; a book excerpt explored Sam Altman’s brief ouster and return to OpenAI, shedding light on the company’s internal power struggles; and researchers introduced a new byte-based model that surpasses Llama 3 and other token-based models on tasks involving misspellings, noisy input, and translation.


Subscribe to Data Points

Share

Subscribe to Data Points

Your accelerated guide to AI news and research