2 Million Tokens of Context & More Google’s I/O developers’ conference reveals new AI models, features, and upgrades.

Published

May 22, 2024

Reading time

3 min read

Google’s annual I/O developers’ conference brought a plethora of updates and new models.

What’s new: Google announced improvements to its Gemini 1.5 Pro large multimodal model — notably increasing its already huge input context window — as well as new open models, a video generator, and a further step in digital assistants. In addition, Gemini models will power new features in Google Search, Gmail, and Android.

How it works: Google launched a variety of new capabilities.

Gemini 1.5 Pro’s maximum input context window doubled to 2 million tokens of text, audio, and/or video — roughly 1.4 million words, 60,000 lines of code, 2 hours of video, or 22 hours of audio. The 2 million-token context window is available in a “private preview” via Google’s AI Studio and Vertex AI. The 1 million-token context window ($7 per 1 million tokens) is generally available on those services in addition to the previous 128,000 window ($3.50 per 1 million tokens).
Gemini 1.5 Flash is a faster distillation of Gemini 1.5 Pro that features a 1 million token context window. It’s available in preview via Vertex AI. Due to be generally available in June, it will cost $0.35 per million tokens of input for prompts up to 128,000 tokens or $0.70 per million tokens of input for longer prompts.
The Veo video generator can create videos roughly a minute long at 1080p resolution. It can also alter videos, for instance keeping part of the imagery constant and regenerating the rest. A web interface called VideoFX is available via a waitlist. Google plans to roll out Veo to YouTube users.
Google expanded the Gemma family of open models. PaliGemma, which is available now, accepts text and images and generates text. Gemma 2, which will be available in June, is a 27 billion-parameter large language model that aims to match the performance of Llama 3 70B at less than half the size.
Gemini Live is a smartphone app for real-time voice chat. The app can converse about photos or video captured by the phone’s camera — in the video demo shown above, it remembers where the user left her glasses! It’s part of Project Astra, a DeepMind initiative that aims to create real-time, multimodal digital assistants.

Precautionary measures: Amid the flurry of new developments, Google published protocols for evaluating safety risks. The “Frontier Safety Framework” establishes risk thresholds such as a model’s ability to extend its own capabilities, enable a non-expert to develop a potent biothreat, or automate a cyberattack. While models are in development, researchers will evaluate them continually to determine whether they are approaching any of these thresholds. If so, developers will make a plan to mitigate the risk. Google aims to implement the framework by early 2025.

Why it matters: Gemini 1.5 Pro’s expanded context window enables developers to apply generative AI to multimedia files and archives that are beyond the capacity of other models currently available — corporate archives, legal testimony, feature films, shelves of books — and supports prompting strategies such as many-shot learning. Beyond that, the new releases address a variety of developer needs and preferences: Gemini 1.5 Flash offers a lightweight alternative where speed or cost is at a premium, Veo appears to be a worthy competitor for OpenAI’s Sora, and the new open models give developers powerful options.

We’re thinking: Google’s quick iteration on its Gemini models is impressive. Gemini 1.0 was announced less than six months ago. White-hot competition among AI companies is giving developers more choices, faster speeds, and lower prices.

Subscribe to The Batch