Faster, Cheaper Multimodality All about GPT-4o, OpenAI’s latest multimodal model

Published

May 22, 2024

Reading time

3 min read

OpenAI’s latest model raises the bar for models that can work with common media types in any combination.

What’s new: OpenAI introduced GPT-4o, a model that accepts and generates text, images, audio, and video — the “o” is for omni — more quickly, inexpensively, and in some cases more accurately than its predecessors. Text and image input and text-only output are available currently via ChatGPT and API, with image output coming soon. Speech input and output will roll out to paying users in coming weeks. General audio and video will be available first to partners before rolling out more broadly.

How it works: GPT-4o is a single model trained on multiple media types, which enables it to process different media types and relationships between them faster and more accurately than earlier GPT-4 versions that use separate models to process different media types. The context length is 128,000 tokens, equal to GPT-4 Turbo but well below the 2-million limit newly set by Google Gemini 1.5 Pro.

The demos are impressive. In a video, one of the model’s four optional voices — female, playful, and extraordinarily realistic — narrates a story while adopting different tones from robotic to overdramatic, translates fluidly between English and Italian, and interprets facial expressions captured by a smartphone camera.
API access to GPT-4o costs half as much as GPT-4 Turbo: $5 per million input tokens and $15 per million output tokens.
GPT-4o is 2x faster than GPT-4 Turbo on a per-token basis and expected to accelerate to 5x (10 million tokens per minute) in high volumes.
Audio processing is much faster. GPT-4o responds to audio prompts in 0.3 seconds on average, while ChatGPT’s previous voice mode took 2.8 or 5.4 seconds on average relying on a separate speech-to-text step and then GPT-3.5 or GPT-4, respectively.
An improved tokenizer makes text processing more token-efficient depending on the language. Gujarati, for instance, requires 4.4x fewer tokens, Telegu 3.5x fewer, and Tamil 3.3x fewer. English, French, German, Italian, Portuguese, and Spanish require between 1.1x and 1.3x fewer tokens.

GPT-4o significantly outperforms Gemini Pro 1.5 at several benchmarks for understanding text, code, and images including MMLU, HumanEval, MMMU, and DocVQA. It outperformed OpenAI’s own Whisper-large-v3 speech recognition model at speech-to-text conversion and CoVoST 2 language translation.

Aftershocks: As OpenAI launched the new model, troubles resurfaced that had led to November’s rapid-fire ouster and reinstatement of CEO Sam Altman. Co-founder and chief scientist Ilya Sutskever, who co-led a team that focused on mitigating long-term risks, resigned. He did not give a reason for his departure; previously he had argued that Altman didn’t prioritize safety sufficiently. The team’s other co-leader Jan Leike followed, alleging that the company had a weak commitment to safety. The company promptly dissolved the team altogether and redistributed its responsibilities. Potential legal issues also flared when actress Scarlett Johansson, who had declined an invitation to supply her voice for a new OpenAI model, issued a statement saying that one of GPT-4o’s voices sounded “eerily” like her own and demanding to know how the artificial voice was built. OpenAI denied that it had used or tried to imitate Johansson’s voice and withdrew that voice option.

Why it matters: Competition between the major AI companies is putting more powerful models in the hands of developers and users at a dizzying pace. GPT-4o shows the value of end-to-end modeling for multimodal inputs and outputs, leading to significant steps forward in performance, speed, and cost. Faster, cheaper processing of tokens makes the model more responsive and lowers the barrier for powerful agentic workflows, while tighter integration between processing of text, images, and audio makes multimodal applications more practical.

We’re thinking: Between GPT-4o, Google’s Gemini 1.5, and Meta’s newly announced Chameleon, the latest models are media omnivores. We’re excited to see what creative applications developers build as the set of tasks such models can perform continues to expand!

Subscribe to The Batch