Adapting R1-like techniques to video reasoning Anthropic builds an “AI microscope” to probe Claude’s internal anatomy

Published

Mar 31, 2025

Reading time

4 min read

Twice a week, Data Points brings you the latest AI news, tools, models, and research in brief. In today’s edition, you’ll find:

How Alibaba built its compact but powerful video generation models
Towards a unified text-image diffusion model
A new approach to vision-language understanding from Alibaba
Microsoft adapts OpenAI models to build data workforce agents

But first:

New approach to reinforcement learning boosts video understanding

Researchers at CUHK and other institutions created Video-R1, a new fully open source AI model designed to improve video reasoning capabilities in multimodal large language models through reinforcement learning. The team created two new datasets combining both image and video data for training and developed T-GRPO, a novel training algorithm that encourages temporal reasoning by comparing model performance on ordered versus shuffled video frames. At seven billion parameters, their Video-R1-7B model achieves state-of-the-art performance across multiple video reasoning benchmarks, notably reaching 35.8 percent accuracy on the VSI-Bench spatial reasoning test, surpassing GPT-4o. (arXiv and GitHub)

Language model mysteries revealed: How Claude thinks and plans

Anthropic researchers used new interpretability techniques modeled on laboratory biology to examine how Claude processes information internally. By conducting experiments modififying Claude’s internal states, the team discovered that Claude plans ahead when writing poetry, uses parallel processing paths for mental math, and operates in a shared conceptual space across different languages. Although these methods only capture part of the total computations happening inside LLMs, these findings could help researchers better understand how AI systems work and could lead to more reliable and transparent models. (Anthropic)

Alibaba launches powerful video generation model with open weights

Alibaba Group released its technical report for Wan2.1, a suite of video and audio generation models available under an Apache 2.0 license. Wan2.1’s 1.3 billion parameter version requires only 8.19 GB of VRAM and can generate 5-second 480P videos in about 4 minutes on consumer GPUs. Its 14 billion parameter version shows strong capabilities in text-to-video, image-to-video, video editing, and in-video text generation in both Chinese and English, a novel capability. The paper details Wan2.1’s complete technical architecture, from its VAE and DiT model designs to training methods, data preparation, and performance optimization strategies. It also shows that Wan2.1 outperforms Runway and unspecified closed models on multiple benchmarks, including image-to-video and text-to-video evaluation. (arXiv and GitHub)

Novel discrete diffusion model unifies text and image generation

Researchers at Carnegie Mellon developed UniDisc, a new multimodal architecture that applies discrete diffusion techniques to jointly generate text and images. The model introduces several technical innovations, including a unified masking strategy and classifier-free guidance, which enable it to outperform autoregressive baselines in conditional generation tasks and perform unusual tasks like simultaneous text-image inpainting. While UniDisc requires approximately 13 times more compute during training compared to autoregressive approaches, its ability to perform parallel inference and iteratively refine outputs leads to better generation quality and more efficient inference, particularly when scaling to larger models. (GitHub and arXiv)

Alibaba introduces QVQ-Max visual analysis model

Alibaba released QVQ-Max, a new visual reasoning model that analyzes images and videos while performing tasks like solving mathematical problems and generating code to recreate selected images. The model improves upon the company’s QVQ-72B-Preview from December 2023, adding adjustable levels of “thinking” by generating more reasoning tokens. For example, QVQ-Max can improve its performance on the MathVision multimodal math benchmark from 43.5 percent accuracy to 48.1 percent accuracy by adjusting its generation limit from 4,000 to 24,000 tokens. Alibaba says that this and future visual reasoning models will be able to both answer questions about images more accurately and serve as a creative and productivity tool, helping design or edit illustrations, blueprints, and other graphics. (GitHub)

Microsoft previews research and analysis tools for Copilot

Microsoft demonstrated two new AI agents called Researcher and Analyst, both designed to help workers analyze company data and web information. Researcher uses OpenAI’s deep research model to conduct complex investigations and create reports, while Analyst specializes in data analysis, using the o3-mini reasoning model to manage data queries with Python. The new tools, which will roll out to Microsoft 365 Copilot customers in April through a “Frontier” program, are part of Microsoft’s push to embed specialized AI capabilities directly into workplace software, using data in its cloud. (Microsoft)

Still want to know more about what matters in AI right now?

Read last week’s issue of The Batch for in-depth analysis of news and research.

Last week, Andrew Ng shared his thoughts on when fine-tuning small language models is truly necessary — and when simpler approaches like prompting or agentic workflows may be more effective and easier to maintain.

“Because it adds extra complexity both in training and deployment, usually I resort to this technique only after I find that prompting and simple agentic workflows are not up to a task.”

Read Andrew’s full letter here.

Other top AI news and research stories we covered in depth: Google released Gemma 3, a family of compact vision-language models with open weights, enabling multimodal capabilities on a single GPU; researchers introduced shortcut models that generate high-quality diffusion images in fewer steps, improving speed without sacrificing performance; a study showed that GPT-4 can significantly enhance remote tutors’ effectiveness by providing real-time pedagogical support; and a new technique using pretrained embeddings like DINOv2 helped diffusion transformers learn faster, reducing training time while improving image quality.

Subscribe to Data Points