Google’s Multimodal Challenger All you need to know about Gemini, Google's new multimodal model

Published

Dec 13, 2023

Reading time

3 min read

Google unveiled Gemini, its bid to catch up to, and perhaps surpass, OpenAI’s GPT-4.

What’s new: Google demonstrated the Gemini family of models that accept any combination of text (including code), images, video, and audio and output text and images. The demonstrations and metrics were impressive — but presented in misleading ways.

How it works: Gemini will come in four versions. (i) Gemini Ultra, which will be widely available next year, purportedly exceeds GPT-4 in key metrics. (ii) Gemini Pro offers performance comparable to GPT-3.5. This model now underpins Google’s Bard chatbot for English-language outside Europe. It will be available for corporate customers who use Google Cloud’s Vertex AI service starting December 13, and Generative AI Studio afterward. (Google did not disclose parameter counts for Pro or Ultra.) Two distilled versions — smaller models trained to mimic the performance of a larger one — are designed to run on Android devices: (iii) Gemini Nano-1, which comprises 1.8 billion parameters, and (iv) Nano-2, at 3.25 parameters. A Gemini Nano model performs tasks like speech recognition, summarization, automatic replies, image editing, and video enhancement in the Google Pixel 8 Pro phone.

Gemini models are based on the transformer architecture and can process inputs of up to 32,000 tokens (equal to GPT-4, but less than GPT-4 Turbo’s 128,000 tokens and Claude 2’s 200,000 tokens). They process text, images, video, and audio natively, so, for instance, they don’t translate audio into text for processing or use a separate model for image generation.
Google did not disclose the contents or provenance of Gemini’s training data, which included web documents, books, and code run tokenized by SentencePiece, as well as image, video, and audio data.
Ultra outperformed GPT-4 and GPT-4V on a number of selected metrics including BIG-bench-Hard, DROP, and MMLU. It also outperformed other models at code generation and math problems.

Misleading metrics: The metrics Google promoted to verify Gemini Ultra’s performance are not entirely straightforward. Google pits Gemini Ultra against GPT-4. However, Gemini Ultra is not yet available, while GPT-4 Turbo already surpasses GPT-4, which outperforms Gemini Pro. Gemini Ultra achieved 90 percent accuracy (human-expert level) on MMLU, which tests knowledge and problem-solving abilities in fields such as physics, medicine, history, and law. Yet this achievement, too, is misleading. Ultra achieved this score via chain-of-thought prompting with 32 examples, while most scores on the MMLU leaderboard are 5-shot learning. By the latter measure, GPT-4 achieves better accuracy.

Manipulated demo: Similarly, a video of Gemini in action initially made a splash, but it was not an authentic portrayal of the model’s capabilities. A Gemini model appeared to respond in real time, using a friendly synthesized voice, to audio/video input of voice and hand motions. Gemini breezily chatted its way through tasks like interpreting a drawing in progress as the artist added each line and explaining a sleight-of-hand trick in which a coin seemed to disappear. However, Google explained in a blog post that the actual interactions did not involve audio or video. In fact, the team had entered words as text and video as individual frames, and the model had responded with text. In addition, the video omitted some prompts.

Why it matters: Gemini joins GPT-4V and GPT-4 Turbo in handling text, image, video, and audio input and, unlike the GPTs, it processes those data types within the same model. The Gemini Nano models look like strong entries in an emerging race to put powerful models on small devices at the edge of the network.

We’re thinking: We celebrate the accomplishments of Google’s scientists and engineers. It’s unfortunate that marketing missteps distracted the community from their work.

Subscribe to The Batch