Large Multimodal Models (LMMs)

32 Posts

Through a rainy window, a pizza worker prepares food beneath menu boards and a red neon "Pizza" sign.
Large Multimodal Models (LMMs)

ByteDance Bids for Video Leadership: ByteDance adds state-of-the-art Seedance 2.0 video to Capcut, while OpenAI retreats

As OpenAI prepares to shut down Sora, ByteDance made its own video generation model available to hundreds of millions of users.
Table highlights Kimi K2.6's dominance in agentic tasks with 86.3 and coding at 58.6, surpassing other models.
Large Multimodal Models (LMMs)

Kimi K2.6 Challenges Open-Weights Champs: Kimi K2.6 matches open Qwen3.6 Max andDeepSeek V4, falls just behind top closed models.

Moonshot AI’s updated Kimi model handles longer autonomous coding sessions and scales up its multi-agent orchestration relative to its predecessor.
GPT-5.5 leads in Terminal-Bench 2.0 with 82.7% score, highlighting performance contrast against competitors.
Large Multimodal Models (LMMs)

GPT-5.5 Outperforms, Hallucinates: OpenAI’s latest model tops leaderboards for coding, visual puzzles, and overall intelligence

The latest update of OpenAI’s flagship model sets new states of the art in important benchmarks but has difficulty distinguishing between what it does and doesn't know.
A table compares AI models, highlighting Muse Spark's performance across multimodal and health benchmarks.
Large Multimodal Models (LMMs)

Life After Llama: With Muse Spark, Meta pivots away from its open-weights Llama strategy

Meta pivoted from its open-weights strategy to deliver a closed alternative.
Apple's AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects
Large Multimodal Models (LMMs)

A Single Tokenizer for Visual Media: Apple’s AToken, a multimodal model with a single encoder and tokenizer for images, videos, and 3D objects

Multimodal models typically use different tokenizers to embed different media types, and different encoders when training to generate media rather than classify it.
DeepSeek made its upcoming 4.0 model available for performance testing to Chinese chipmakers but not U.S, ones
Large Multimodal Models (LMMs)

DeepSeek Snubs Nvidia for Huawei: DeepSeek made its upcoming 4.0 model available for performance testing to Chinese chipmakers but not U.S, ones

DeepSeek, the Chinese developer of outstanding open-weights models, has withheld an upcoming update of its flagship model from U.S. chip makers, a move that intensifies the AI rivalry between the U.S. and China.
Alibaba's latest flagship models are open-weights MoE performers in sizes from less than 1B parameters
Large Multimodal Models (LMMs)

Qwen3.5 Outperforms Bigger Models, Leads Vision Benchmarks: Alibaba’s latest flagship models are open-weights MoE performers in sizes from less than 1B parameters

The Qwen3.5 family of open-weights vision-language models includes impressive larger models as well as a smaller one that outperforms an OpenAI open-weights model 10 times its size.
Table shows GPT-5.4 outperforms in GDPval and Tau2-bench Telecom, setting new state-of-the-art scores.
Large Multimodal Models (LMMs)

GPT-5.4’s Higher Performance, Higher Price: OpenAI’s GPT-5.4 Pro and GPT-5.4 Thinking challenge Google’s Gemini 3.1 Pro Preview as best all-around AI model

OpenAI updated its flagship models, extending the ability to use tools and setting the state of the art on a handful of benchmarks, and priced them at the top of the market. Its coding and agentic abilities have enabled Codex, OpenAI’s competitor to Anthropic’s Claude Code, to leap ahead.
AI-generated scenes including ornate signage, a beachgoer’s tattoo, and cactus and honeycomb cars, illustrating Nano Banana 2’s realism.
Large Multimodal Models (LMMs)

Nano Banana 2 Ups Performance/Price: Gemini 3.1 Flash Image makes photo generation and edits easier and faster

Google launched a cheaper, faster successor to its flagship image generator, delivering greater interactivity at roughly half the price.
A benchmark table shows Gemini 3.1 Pro leading in performance across several tested metrics.
Large Multimodal Models (LMMs)

Gemini Takes the Lead: Google releases Gemini 3.1 Pro in preview, tops Intelligence Index at same price

Google updated its flagship Gemini model, topping several benchmarks while undercutting competitors on performance per dollar.
Flowchart showing Mistral Small 3.1 model distillation into smaller Ministral 3 models with post-training steps.
Large Multimodal Models (LMMs)

Recipe for Smaller, Capable Models: Mistral uses cascade distillation on Mistral 3 to build Ministral family

Mistral compressed Mistral Small 3.1 into much smaller versions, yielding a family of relatively small, open-weights, vision-language models that perform better by some measures than competing models of similar size. The method combines pruning and distillation.
Flowchart showing Kimi K2.5 AI orchestrating tasks among various specialized subagents.
Large Multimodal Models (LMMs)

Kimi K2.5 Creates Its Own Workforce: Moonshot AI takes the open model crown with vision updates, aided by subagents

An open source vision-language model unleashes minion agents that enable it to perform tasks more quickly and effectively.
Collage with comic strip, concert poster, diagrams on water cycle and trash sorting, and movie poster.
Large Multimodal Models (LMMs)

Refining Words in Pictures: Z.ai’s GLM-Image blends transformer and diffusion architectures for better text in images

Image generators often mangle text. An open-weights model outperforms open and proprietary competitors in text rendering.
Pengtao Xie is pictured standing near a chalkboard filled with mathematical notes, addressing a classroom of attentive students.
Large Multimodal Models (LMMs)

Multimodal Models for Biomedicine by Pengtao Xie: Pengtao Xie of UC-San Diego on why medical models need to visualize tiny chemicals and large organs

Over the past few years, we have seen rapid progress in models that jointly reason over text, images, sequences, graphs, and time series. Yet in biomedical settings, these capabilities often remain fragmented, brittle, or difficult to interpret.
Tanmay Gupta is pictured smiling next to a whiteboard filled with mathematical formulas, embodying active AI engagement.
Large Multimodal Models (LMMs)

From Prediction to Action by Tanmay Gupta: Tanmay Gupta of the Allen Institute on building AI for long-horizon tasks

AI research in 2026 should confront a simple but transformative realization: Models that predict are not the same as systems that act. The latter is what we actually need.
Load More

Subscribe to The Batch

Stay updated with weekly AI News and Insights delivered to your inbox