Text-Only LLM Goes Multimodal LLMs learn to caption images, video, and audio without further training

Published

Apr 23, 2025

Reading time

2 min read

Large language models excel at processing text but can’t interpret images, video, or audio directly without further training on those media types. Researchers devised a way to overcome this limitation.

What’s new: Kumar Ashutosh and colleagues at Meta, University of Texas, and UC Berkeley introduced Multimodal Iterative LLM Solver (MILS), a method that pairs a text-only large language model (LLM) with a multimodal embedding model to generate captions for images, video, and audio without further training.

Key insight: LLMs can generate text and refine their outputs based on new information. On the other hand, multimodal embedding models can score the similarity between a given text and an image, video, or audio clip. Given this score, an LLM can regenerate the text iteratively until the score indicates a strong match between the text and the associated media. This enables the LLM to generate accurate captions for images, videos, and audio clips without training in these tasks.

How it works: Given a prompt and an image, video, or audio clip, Llama 3.1 8B produced and iteratively refined the prompt according to a pretrained multimodal embedding model’s estimate of the similarity between the text and media.

The LLM generated 30,000 to 50,000 initial captions to prime the process.
Given each caption and a media file, a multimodal model estimated their semantic similarity scores. SigLIP evaluated text and images, ViCLIP text and video, and ImageBind text and audio.
Based on the top 50 most-similar previous captions, the LLM generated new captions.
The system repeated the previous two steps until the top-scoring texts changed little or the LLM reached a predetermined number of iterations.

Results: The authors evaluated MILS on captioning images, videos, and audio clips. They measured performance according to Metric for Evaluation of Translation with Explicit ORdering (METEOR), which checks for synonyms, words that share the same root, and word order to determine whether a generated caption matches a ground-truth caption (higher is better). Overall, MILS outperformed models that underwent task-specific training.

On the MSCOCO dataset for image captioning, MILS achieved 15.0 METEOR, while MeaCap achieved 14.1 METEOR.
On MSR-VTT, which evaluates video captioning, MILS attained 14.4 METEOR, while a model trained to caption videos achieved 11.3 METEOR.
On Clotho, which assesses audio captions, MILS achieved a METEOR of 12.4, while ZerAuCap reached 9.4 METEOR.

Why it matters: Zero-shot captioning models like Aya Vision and Pixtral require training on paired captions and media. The authors’ approach takes advantage of pretrained multimodal models to enable an LLM to compose multimedia captions without further training.

We’re thinking: Synthetic data is increasingly useful for training AI models. By enabling LLMs to synthesize good captions, MILS adds fuel to this fire.

Subscribe to The Batch