Large Multimodal Models (LMMs)

10 Posts

Side-by-side of a fern leaf and its digital code representation, illustrating nature's pattern-to-code transformation.

Google I/O Overdrive: Google’s new AI offerings include Veo 3 video generator, lightweight Gemma 3n, updates to Gemini Pro and Ultra, and more

Google revamped its roster of models, closed and open, and added more AI-powered features to its existing products.

AI music generation interface showing waveform and text prompts like deep house, djembe, and saxophone.

Large Multimodal Models (LMMs)

Music Generation for Pros: Google upgrades its AI music tools for professional use

Google refreshed its experimental tools for composers and producers.

Animation showing GPT Image 1 generating AI images: emotions, surreal scenes, satire, fantasy, and photo-realistic edits.

Large Multimodal Models (LMMs)

New Image Generator for OpenAI API: OpenAI launches API access to GPT Image 1, ChatGPT’s viral image generator

ChatGPT’s image generator is available via API.

AI diagram showing generator and scorer loop to produce final output based on test image of a cat.

Large Multimodal Models (LMMs)

Text-Only LLM Goes Multimodal: LLMs learn to caption images, video, and audio without further training

Large language models excel at processing text but can’t interpret images, video, or audio directly without further training on those media types. Researchers devised a way to overcome this limitation.

Architecture of Qwen2.5-Omni showing multimodal processing with vision and audio encoders, thinker, talker, and decoder.

Large Multimodal Models (LMMs)

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.

Llama 4 Behemoth benchmark chart comparing coding, reasoning, and multilingual scores with Claude, Gemini, and GPT-4.5.

Large Multimodal Models (LMMs)

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Meta updated its popular open-weights models, claiming performance superior to closed competitors in three size classes.

Mochi-style illustrated characters with diverse facial expressions used for AI emotion recognition visualizations.

Large Multimodal Models (LMMs)

Interactive Voice-to-Voice With Vision: MoshiVis adds image understanding to voice-first conversations

Researchers updated the highly responsive Moshi voice-to-voice model to discuss visual input.

Comparison table of Gemini and Gemma models across benchmarks like MMLU, MATH, and GPQA with radar charts.

Large Multimodal Models (LMMs)

Vision-Language, Compact and Open: Google releases Gemma 3 vision-language models with open weights

Google updated its open-weights family of large language models to include versions that handle image and video inputs.

AYA Vision architecture diagram showing vision encoder, multimodal merging, and LLM backbone for image processing

Large Multimodal Models (LMMs)

Equally Fluent in Many Languages: Cohere’s Aya Vision beats multilingual rivals in text & image understanding

Multilingual AI models often suffer uneven performance across languages, especially in multimodal tasks. A pair of lean models counters this trend with consistent understanding of text and images across major languages.

Phi-4 Mini multimodal architecture integrating vision, audio, and text with token merging and LoRA-adapted weights for AI processing.

Large Multimodal Models (LMMs)

Microsoft Tackles Voice-In, Text-Out: Microsoft’s Phi-4 Multimodal model can process text, images, and speech simultaneously

Microsoft debuted its first official large language model that responds to spoken input.

Large Multimodal Models (LMMs)

Google I/O Overdrive: Google’s new AI offerings include Veo 3 video generator, lightweight Gemma 3n, updates to Gemini Pro and Ultra, and more

Music Generation for Pros: Google upgrades its AI music tools for professional use

New Image Generator for OpenAI API: OpenAI launches API access to GPT Image 1, ChatGPT’s viral image generator

Text-Only LLM Goes Multimodal: LLMs learn to caption images, video, and audio without further training

Better Multimodal Performance With Open Weights: Qwen2.5-Omni 7B raises the bar for small multimodal models

Llama’s Mixture of Vision-Language Experts: Meta releases Llama 4 models, claims edge over AI competitors

Interactive Voice-to-Voice With Vision: MoshiVis adds image understanding to voice-first conversations

Vision-Language, Compact and Open: Google releases Gemma 3 vision-language models with open weights

Equally Fluent in Many Languages: Cohere’s Aya Vision beats multilingual rivals in text & image understanding

Microsoft Tackles Voice-In, Text-Out: Microsoft’s Phi-4 Multimodal model can process text, images, and speech simultaneously

Subscribe to The Batch