Large Multimodal Models (LMMs)

28 Posts

Apple's AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects
Large Multimodal Models (LMMs)

A Single Tokenizer for Visual Media: Apple’s AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects

Multimodal models typically use different tokenizers to embed different media types, and different encoders when training to generate media rather than classify it.
DeepSeek made its upcoming 4.0 model available for performance testing to Chinese chipmakers but not U.S, ones
Large Multimodal Models (LMMs)

DeepSeek Snubs Nvidia for Huawei: DeepSeek made its upcoming 4.0 model available for performance testing to Chinese chipmakers but not U.S, ones

DeepSeek, the Chinese developer of outstanding open-weights models, has withheld an upcoming update of its flagship model from U.S. chip makers, a move that intensifies the AI rivalry between the U.S. and China.
Alibaba's latest flagship models are open-weights MoE performers in sizes from less than 1B parameters
Large Multimodal Models (LMMs)

Qwen3.5 Outperforms Bigger Models, Leads Vision Benchmarks: Alibaba’s latest flagship models are open-weights MoE performers in sizes from less than 1B parameters

The Qwen3.5 family of open-weights vision-language models includes impressive larger models as well as a smaller one that outperforms an OpenAI open-weights model 10 times its size.
Table shows GPT-5.4 outperforms in GDPval and Tau2-bench Telecom, setting new state-of-the-art scores.
Large Multimodal Models (LMMs)

GPT-5.4’s Higher Performance, Higher Price: OpenAI’s GPT-5.4 Pro and GPT-5.4 Thinking challenge Google’s Gemini 3.1 Pro Preview as best all-around AI model

OpenAI updated its flagship models, extending the ability to use tools and setting the state of the art on a handful of benchmarks, and priced them at the top of the market. Its coding and agentic abilities have enabled Codex, OpenAI’s competitor to Anthropic’s Claude Code, to leap ahead.
AI-generated scenes including ornate signage, a beachgoer’s tattoo, and cactus and honeycomb cars, illustrating Nano Banana 2’s realism.
Large Multimodal Models (LMMs)

Nano Banana 2 Ups Performance/Price: Gemini 3.1 Flash Image makes photo generation and edits easier and faster

Google launched a cheaper, faster successor to its flagship image generator, delivering greater interactivity at roughly half the price.
A benchmark table shows Gemini 3.1 Pro leading in performance across several tested metrics.
Large Multimodal Models (LMMs)

Gemini Takes the Lead: Google releases Gemini 3.1 Pro in preview, tops Intelligence Index at same price

Google updated its flagship Gemini model, topping several benchmarks while undercutting competitors on performance per dollar.
Flowchart showing Mistral Small 3.1 model distillation into smaller Ministral 3 models with post-training steps.
Large Multimodal Models (LMMs)

Recipe for Smaller, Capable Models: Mistral uses cascade distillation on Mistral 3 to build Ministral family

Mistral compressed Mistral Small 3.1 into much smaller versions, yielding a family of relatively small, open-weights, vision-language models that perform better by some measures than competing models of similar size. The method combines pruning and distillation.
Flowchart showing Kimi K2.5 AI orchestrating tasks among various specialized subagents.
Large Multimodal Models (LMMs)

Kimi K2.5 Creates Its Own Workforce: Moonshot AI takes the open model crown with vision updates, aided by subagents

An open source vision-language model unleashes minion agents that enable it to perform tasks more quickly and effectively.
Collage with comic strip, concert poster, diagrams on water cycle and trash sorting, and movie poster.
Large Multimodal Models (LMMs)

Refining Words in Pictures: Z.ai’s GLM-Image blends transformer and diffusion architectures for better text in images

Image generators often mangle text. An open-weights model outperforms open and proprietary competitors in text rendering.
Pengtao Xie is pictured standing near a chalkboard filled with mathematical notes, addressing a classroom of attentive students.
Large Multimodal Models (LMMs)

Multimodal Models for Biomedicine by Pengtao Xie: Pengtao Xie of UC-San Diego on why medical models need to visualize tiny chemicals and large organs

Over the past few years, we have seen rapid progress in models that jointly reason over text, images, sequences, graphs, and time series. Yet in biomedical settings, these capabilities often remain fragmented, brittle, or difficult to interpret.
Tanmay Gupta is pictured smiling next to a whiteboard filled with mathematical formulas, embodying active AI engagement.
Large Multimodal Models (LMMs)

From Prediction to Action by Tanmay Gupta: Tanmay Gupta of the Allen Institute on building AI for long-horizon tasks

AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.
Diagram shows LLM training with encoders for images, audio, video; inference with galaxies, satellites.
Large Multimodal Models (LMMs)

Adapting LLMs to Any Sort of Data: SEMI (Sample-Efficient Modality Integration) tackles new domains with few-shot examples

Enabling a pretrained large language model to process a data type other than text (say, images), possibly in a specialized domain (say, radiology), typically requires thousands to millions of examples that pair the other data (perhaps x-rays) with text.
A superhero in blue and red kneels in front of cityscape, holding a shield with the OpenAI logo.
Large Multimodal Models (LMMs)

Disney Teams Up With OpenAI: OpenAI’s Sora video generator will include Disney characters, with fan videos on Disney+

Disney, the entertainment conglomerate that owns Marvel, Pixar, Lucasfilm and its own animated classics from 101 Dalmatians to Zootopia, licensed OpenAI to use its characters in generated videos.
GIF showing a robotic arm picking up glasses on a table and handling tools on a kitchen countertop.
Large Multimodal Models (LMMs)

Coherent, Interactive Worlds: Runway’s GWM-1 models generate videos with consistent physics for robots and entertainment

Runway’s GWM-1 family of video-generation models respond to user input in real time while producing scenes that remain consistent regardless of the camera’s position.
Table comparing Nova 2 Pro to other models in reasoning, coding, perception, and workflows.
Large Multimodal Models (LMMs)

Amazon Steps Forward: Nova 2 family boosts cost-effective performance, adds new agentic features

Amazon raised the competitive profile of its foundation models and added services for custom model training and an agent platform for browser automation.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox