Alibaba’s latest open-weights system raises the bar for multimodal tasks in a relatively small model.
What’s new: Alibaba released Qwen2.5-Omni 7B.
- Input/output: Input: text, images (up to 10 MB per file), audio (up to 10 MB and 3 minutes per file), video (up to 150 MB and 40 seconds per file) for a total of up to 32,768 tokens. Output: text, speech
- Performance: State of the art in some audio- and image-to-text benchmarks
- Training data: 18 trillion tokens of text (identical to Qwen2.5), 800 billion tokens of images and videos, 300 billion tokens of audio, 100 billion tokens of video with audio
- Undisclosed: Knowledge cutoff, output size, adapter architecture
- Availability: Weights free to download under the Apache 2.0 license.
- API price: Input: 0.4 Yuan per million tokens of text, 25 Yuan per million tokens of audio, 1.5 Yuan per million tokens of images/video. Output: 1.6 Yuan per million tokens of text with text-only input; 4.5 Yuan per million tokens of text with audio, video, or image input; 50 Yuan per million tokens of audio with any input.
How it works: Qwen2.5-Omni 7B comprises a pretrained text transformer (Qwen 2.5 7B), pretrained vision encoder (Qwen2.5-VL), pretrained audio encoder (Whisper-large-v3), speech transformer, and audio decoder (a transformer plus BigVGAN), along with corresponding adapters of undisclosed architecture.
- The team pretrained the system in three stages. First, they pretrained the vision and audio encoders and their adapters with the frozen text transformer to generate the next text token in audio-text and image-text data. In the second stage, they pretrained the entire system to generate the next text or audio token in 1.2 trillion tokens of multimodal data. In the last stage, they pretrained the system on longer multimodal inputs.
- They fine-tuned the text transformer to generate the next token in a dataset of multimodal instruction-following tasks.
- They fine-tuned the speech transformer in three stages. First they fine-tuned the model to generate the next speech token in multimodal dialogues. Then they fine-tuned it to prefer generating speech with fewer erroneous words or unnecessary pauses via Direct Preference Optimization. Finally, they fine-tuned it to reproduce the sounds of a few particular human voices.
- At inference, given images, audio, video, and/or a text input, the vision encoder embeds video frames/images and the audio encoder embeds audio (including video soundtracks). The adapters transform the embedded frames/images and audio for further processing. From the text and embedded frames and audio, the text transformer generates the next text token plus high-level embeddings of input text, images, video, and audio. From the generated text and high-level embeddings, the speech transformer generates the next speech tokens. Finally, the audio decoder turns speech tokens into audio.
Results: The authors compared Qwen2.5-Omni 7B to similarly sized models. It performed especially well on audio-to-text, image-to-text, and video-to-text tasks. However, it performed less well on text-to-text and text-to-speech tasks.
- Qwen2.5-Omni 7B achieved state-of-the-art measures on most of the audio-to-text benchmarks tested. For example, when transcribing recorded English speech in Common Voice 15, Qwen2.5-Omni 7B (7.6 percent word error rate) beat the next-best model MinMo (7.9 percent word error rate).
- Qwen2.5-Omni 7B achieved state-of-the-art performance on some image-to-text tasks including MMstar, where it tied with MiniCPM-V (64 percent accuracy) and beat GPT-4o-mini (54.8 percent accuracy).
- In 10 text-to-text benchmarks, Qwen2.5-Omni 7B underperformed Qwen 2.5-7B but generally was comparable with Qwen2-7B, Llama 3.1-8B, and Gemma2-9B.
- On the English subset of Seed, in which the system renders text in a particular speaker’s voice based on a snippet of reference audio, Qwen2.5-Omni 7B (2.33 percent word error rate) underperformed F5-TTS (1.83 percent word error rate).
Behind the news: Multimodal systems with open weights are multiplying. For instance, AnyGPT (open weights, training, and inference code) accepts and generates speech, text, images, and music. Similarly, Mini-Omni2 (open weights and inference code) accepts and generates text, speech, and images.
Why it matters: Multimodal models typically show steep degradation on measurements of instruction-following when shifting from voice to text, but Qwen2.5-Omni does not. As the world moves toward voice-to-voice interactions, open systems that deliver performance comparable to that of closed competitors accelerate progress towards better conversations.
We’re thinking: The Qwen team is on fire! Alibaba’s steady stream of highly capable open-weights models is a gift to AI developers.