Learning After Overfitting Transformers Continue Learning After Overfitting Data

Published

Apr 06, 2022

Reading time

2 min read

When a model trains too much, it can overfit, or memorize, the training data, which reduces its ability to analyze similar-but-different inputs. But what if training continues? New work found that overfitting isn’t the end of the line.

What's new: Training relatively small architectures on an algorithmically generated dataset, Alethea Power and colleagues at OpenAI observed that ongoing training leads to an effect they call grokking, in which a transformer’s ability to generalize to novel data emerges well after overfitting.

Key insight: It takes a lot of computation to study how learning progresses over time in models with billions of parameters that train on datasets of millions of examples. It’s equally revealing — and more practical — to study models with hundreds of thousands of parameters that train on thousands of examples. Models on that scale can train through many more steps in far less time.

How it works: The authors trained a set of transformers to classify the solutions to each of 12 two-variable equations, mostly polynomials.

For each equation, they plugged in the possible values for both variables to find all possible solutions. This yielded roughly 10,000 input-output pairs per expression to be divided between training, test, and validation sets.
To feed an equation into a transformer, they represented each equation in a form similar to 2*3=6 but substituted each token with a symbol; say, a for 2, m for *, b for 3, q for =, and so on.
They continued training well beyond the point where training accuracy increased while validation accuracy decreased, a typical indicator for overfitting.

Results: As the models trained, validation accuracy rose, fell, and — after the number of training steps continued to rise by a factor of 1,000 — rose a second time. (In the case of modular division, validation accuracy improved from nearly 5 percent to nearly 100 percent). In experiments using reduced datasets, the authors found that the smaller the training set, the more training was needed to achieve the second increase. For instance, when training on 30 percent as many examples, roughly 45 percent more training steps were required.

Why it matters: Grokking may be the way that double descent, in which a model’s performance improves, worsens, and improves again as the number of parameters or training examples increases, plays out with small models and datasets. That said, this work provides evidence that we've been mistaken about the meaning of overfitting. Models can continue to learn after they overfit and can go on to become quite capable.

We're thinking: The authors discovered this phenomenon in a petri dish. Now we need to find out whether it holds with life-size models and datasets.

Subscribe to The Batch