Masking Private Data in Training Sets Google researchers released VaultGemma, an open-weights model redacting personal information

Published

Nov 05, 2025

Reading time

3 min read

Large language models often memorize details in their training data, including private information that may appear only once, like a person’s name, address, or phone number. Researchers built the first open-weights language model that’s guaranteed not to remember such facts.

What’s new: Amer Sinha, Thomas Mesnard, and colleagues at Google released VaultGemma, a 1 billion-parameter model that’s trained from scratch using the technique known as differential privacy. This method prevented the model from memorizing examples that occurred only once in its training data, with a modest sacrifice in performance. Weights are free to download under a license that allows noncommercial and commercial uses with some restrictions.

Differential privacy basics: An algorithm (such as training a neural network) is differentially private if it’s impossible to tell the difference between its product (the learned weights) and its product given that same dataset minus any given example. Since the presence or absence of a single example can’t significantly change the product, personal information can’t leak from the product (the model’s weights) or the consequences of the product (the model’s outputs). In training a neural network, it’s possible to limit the impact of one example by limiting how much each example’s gradient can impact a model’s weights, for instance, by adding noise to each example’s gradient to make it harder to tell from that of other examples.

Key insight: Most previous work applies differential privacy when fine-tuning a model, but that doesn’t prevent the model from memorizing an example during pretraining. Once private information is encoded in the model’s weights, later fine-tuning can’t remove it reliably. Training with differential privacy from the start ensures that such details don’t become embedded in the model.

How it works: VaultGemma follows the same transformer architecture as the 1 billion-parameter version of Google’s Gemma 2. Moreover, the authors pretrained it from scratch on the same 13-trillion-token dataset as Gemma 2 (web, code, and scientific text). The authors applied differential privacy as VaultGemma learned to predict the next token in sequences of 1,024 tokens.

For every example in a batch, the authors computed the gradient and clipped it so its contribution to the weight update didn’t exceed a fixed threshold. This ensured that any given example didn’t have a disproportionate impact on the weight updates relative to other examples.
The authors averaged the clipped gradients across each batch and added Gaussian noise to the average before updating the model’s weights. The noise weakened the influence of unique examples while allowing repeated examples to stand out. As a result, the model’s weights became statistically indistinguishable from those of a model trained without any particular 1,024-token sequence.

Results: VaultGemma showed no measurable memorization across 1 million sequences sampled at random from the training set, while pretrained Gemma 1, 2, and 3 models of roughly similar sizes did. VaultGemma’s average performance across 7 question-answering benchmarks generally matched that of GPT-2 (1.5 billion parameters) but fell short of Gemma models of roughly similar size.

The authors measured memorization in the Gemma family by the percentage of examples a model could reproduce when given the first half of the sequence. Gemma 3 (1 billion parameters) reproduced 0.0005 percent of training examples tested, while Gemma 2 (2 billion parameters) reproduced 0.04 percent. Gemma 1 (2 billion) reproduced about 1 percent. VaultGemma reproduced 0 percent.
VaultGemma achieved 39 percent accuracy on HellaSwag, a benchmark that’s designed to test common-sense reasoning in everyday situations. GPT-2 achieved 48 percent and Gemma 3 (1 billion parameters) reached 61 percent.
On TriviaQA, which measures factual question answering, VaultGemma achieved 11 percent, while GPT-2 achieved 6 percent and Gemma 3 (1 billion parameters) achieved 40 percent.

Yes, but: The privacy protection comes with a caveat: It applies only to unique examples such as a private phone number that occurs only once in the dataset. If private information appears repeatedly in the training data, for instance, a celebrity's street address that leaked and appeared in several publications, the model can learn it as a general pattern.

Why it matters: Private information can find its way into training datasets, and in normal use, a typical large language model may divulge it without the subject’s consent. VaultGemma shows that large open-weights models can be provably private. While such privacy still comes with a cost — VaultGemma 1B performs roughly on par with models built about five years ago — the results are promising, and future work may close that gap.

We’re thinking: The smartest model for sensitive data might be the one that remembers only the most common information.

Subscribe to The Batch