Large Multimodal Models (LMMs)

19 Posts

Pengtao Xie is pictured standing near a chalkboard filled with mathematical notes, addressing a classroom of attentive students.
Large Multimodal Models (LMMs)

Multimodal Models for Biomedicine by Pengtao Xie: Pengtao Xie of UC-San Diego on why medical models need to visualize tiny chemicals and large organs

Over the past few years, we have seen rapid progress in models that jointly reason over text, images, sequences, graphs, and time series. Yet in biomedical settings, these capabilities often remain fragmented, brittle, or difficult to interpret.
Tanmay Gupta is pictured smiling next to a whiteboard filled with mathematical formulas, embodying active AI engagement.
Large Multimodal Models (LMMs)

From Prediction to Action by Tanmay Gupta: Tanmay Gupta of the Allen Institute on building AI for long-horizon tasks

AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.
Diagram shows LLM training with encoders for images, audio, video; inference with galaxies, satellites.
Large Multimodal Models (LMMs)

Adapting LLMs to Any Sort of Data: SEMI (Sample-Efficient Modality Integration) tackles new domains with few-shot examples

Enabling a pretrained large language model to process a data type other than text (say, images), possibly in a specialized domain (say, radiology), typically requires thousands to millions of examples that pair the other data (perhaps x-rays) with text.
A superhero in blue and red kneels in front of cityscape, holding a shield with the OpenAI logo.
Large Multimodal Models (LMMs)

Disney Teams Up With OpenAI: OpenAI’s Sora video generator will include Disney characters, with fan videos on Disney+

Disney, the entertainment conglomerate that owns Marvel, Pixar, Lucasfilm and its own animated classics from 101 Dalmatians to Zootopia, licensed OpenAI to use its characters in generated videos.
GIF showing a robotic arm picking up glasses on a table and handling tools on a kitchen countertop.
Large Multimodal Models (LMMs)

Coherent, Interactive Worlds: Runway’s GWM-1 models generate videos with consistent physics for robots and entertainment

Runway’s GWM-1 family of video-generation models respond to user input in real time while producing scenes that remain consistent regardless of the camera’s position.
Table comparing Nova 2 Pro to other models in reasoning, coding, perception, and workflows.
Large Multimodal Models (LMMs)

Amazon Steps Forward: Nova 2 family boosts cost-effective performance, adds new agentic features

Amazon raised the competitive profile of its foundation models and added services for custom model training and an agent platform for browser automation.
Graph shows Ernie-4.5 outperforming competitors in document understanding and visual reasoning tasks.
Large Multimodal Models (LMMs)

Baidu’s Multimodal Bids: Giant Ernie 5 natively generates multiple media; Ernie-4.5-VL-28B-A3B-Thinking tops Vision-Language metrics

Baidu debuted two models: a lightweight, open-weights, vision-language model and a giant, proprietary, multimodal model built to take on U.S. competitors.
Table shows Gemini 3 Pro leading in benchmarks, outperforming Gemini 2.5, Claude Sonnet 4.5, and GPT-5.1.
Large Multimodal Models (LMMs)

Google Dominates Arena Leaderboards (For the Moment): Gemini 3 Pro and Nano Banana Pro boast best-in-class multimodal reasoning and image generation

Google introduced Gemini 3 Pro and Nano Banana Pro, its flagship vision-language and image-generation models, and deployed them to billions of users worldwide.
Bar chart comparing performance of Qwen3 models against others in diverse tasks, highlighting Qwen3-Max.
Large Multimodal Models (LMMs)

Qwen3 Goes Big (and Smaller): Alibaba expands Qwen3 family with a 1 trillion-parameter Max model, open-weights Qwen3-VL, and the Qwen3-Omni voice model

Alibaba rounded out the Qwen3 family with its biggest large language model to date as well as smaller models that process text, images, video, and/or audio.
Side-by-side of a fern leaf and its digital code representation, illustrating nature's pattern-to-code transformation.
Large Multimodal Models (LMMs)

Google I/O Overdrive: Google’s new AI offerings include Veo 3 video generator, lightweight Gemma 3n, updates to Gemini Pro and Ultra, and more

Google revamped its roster of models, closed and open, and added more AI-powered features to its existing products.
AI music generation interface showing waveform and text prompts like deep house, djembe, and saxophone.
Large Multimodal Models (LMMs)

Music Generation for Pros: Google upgrades its AI music tools for professional use

Google refreshed its experimental tools for composers and producers.
Animation showing GPT Image 1 generating AI images: emotions, surreal scenes, satire, fantasy, and photo-realistic edits.
Large Multimodal Models (LMMs)

New Image Generator for OpenAI API: OpenAI launches API access to GPT Image 1, ChatGPT’s viral image generator

ChatGPT’s image generator is available via API.
AI diagram showing generator and scorer loop to produce final output based on test image of a cat.
Large Multimodal Models (LMMs)

Text-Only LLM Goes Multimodal: LLMs learn to caption images, video, and audio without further training

Large language models excel at processing text but can’t interpret images, video, or audio directly without further training on those media types. Researchers devised a way to overcome this limitation.
Architecture of Qwen2.5-Omni showing multimodal processing with vision and audio encoders, thinker, talker, and decoder.
Large Multimodal Models (LMMs)

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.
Llama 4 Behemoth benchmark chart comparing coding, reasoning, and multilingual scores with Claude, Gemini, and GPT-4.5.
Large Multimodal Models (LMMs)

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Meta updated its popular open-weights models, claiming performance superior to closed competitors in three size classes.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox