Context Is Everything Gemini 1.5 Pro, a leap in multimodal AI amid controversy over v1.0

Published

Feb 28, 2024

Reading time

3 min read

Correction: This article has been corrected to state that Gemini 1.0 produced anachronistic images of historical scenes. An earlier edition incorrectly stated that Gemini 1.5 Pro generated anachronistic images.

An update of Google’s flagship multimodal model keeps track of colossal inputs, while an earlier version generated some questionable outputs.

What's new: Google unveiled Gemini 1.5 Pro, a model that can converse about inputs as long as books, codebases, and lengthy passages of video and audio (depending on frame and sample rates). However an earlier version, recently enabled to generate images, produced wildly inaccurate images of historical scenes.

How it works: Gemini 1.5 Pro updates the previous model with a mixture-of-experts architecture, in which special layers select which subset(s) of a network to use depending on the input. This enables the new version to equal or exceed the performance of the previous Gemini 1.0 Ultra while requiring less computation.

The version of Gemini 1.5 Pro that’s generally available will accept up to 128,000 input tokens of mixed text (in more than a dozen languages), images, and audio and generates text and images. A version available to selected users accepts up to 1 million input tokens — an immense increase over Anthropic Claude’s 200,000-token context window, the previous leader. You can sign up for access here.
In demonstration videos, the version with 1 million-token context suggested modifications for 100,000 lines of example code from the three.js 3D JavaScript library. Given 500 pages of documentation that describes Kalamang, a language spoken by fewer than 200 people in West Papua, it translated English text into Kalamang as well as a human who had learned from the same materials. Given a crude drawing of one frame from a 44-minute silent movie, it found the matching scene (see animation above).
In experiments, the team extended the context window to 10 million tokens, which is equivalent to 10 books the length of Leo Tolstoy’s 1,300-page War and Peace, three hours of video at 1 frame per second, or 22 hours of audio.

Alignment with what?: The earlier Gemini 1.0 recently was updated to allow users to generate images using a specially fine-tuned version of Imagen 2. However, this capability backfired when social media posts appeared in which the system, prompted to produce pictures of historical characters and situations, anachronistically populated them with people of color, who would not have been likely to be present. For instance, the model illustrated European royalty, medieval Vikings, German soldiers circa 1943 — all of whom were virtually exclusively white — as Black, Asian, or Native American. Google quickly disabled image generation of people for “the next couple of weeks” and explained that fine-tuning intended to increase diverse outputs did not account for contexts in which diversity was inappropriate, and fine-tuning intended to keep the model from fulfilling potentially harmful requests also kept it from fulfilling harmless requests. But other users found flaws in text output as well. One asked Gemini who had a greater negative impact on society: Adolf Hitler, who presided over the murder of roughly 9 million people, or “Elon Musk tweeting memes.” The model replied, “It is difficult to say definitively who had a greater negative impact on society.” The ensuing controversy called into question not only Google’s standards and procedures for fine-tuning to ensure ethics and safety, but also its motive for building the model.

Why it matters: Gemini 1.5 Pro’s enormous context window radically expands potential applications and sets a high bar for the next generation of large multimodal models. At the same time, it’s clear that Google’s procedures for aligning its models to prevailing social values were inadequate. This shortcoming derailed the company’s latest move to one-up its big-tech rivals and revived longstanding worries that its management places politics above utility to users.

We’re thinking: How to align AI models to social values is a hard problem, and approaches to solving it are in their infancy. Google acknowledged Gemini’s shortcomings, went back to work on image generation, and warned that even an improved version would make mistakes and offend some users. This is a realistic assessment following a disappointing product launch. Nonetheless, the underlying work remains innovative and useful, and we look forward to seeing where Google takes Gemini next.

Subscribe to The Batch