24 Hours on an Old Consumer GPU Optimizing LLMs for low-resource hardware

Published

Jul 2, 2024

Reading time

2 min read

BERT, a large language model released in 2018 and built upon the then-new transformer architecture, marked a paradigm shift in AI. Researchers explored whether innovations since then would enable them to train an equivalent model while using orders of magnitude less processing power.

What’s new: Jonas Geiping and Tom Goldstein at University of Maryland tried to match BERT using a similar architecture but much less computation. They limited their compute budget to 24 hours on a single, BERT-vintage 24GB Nvidia 2080 Ti processor — about 1/136th of the compute used to train BERT. Drawing a parallel to studying for a test only one day before taking it, they call their process cramming.

Key insight: According to language model scaling laws, the accuracy of a transformer model depends mainly on the sizes of the model and training set. If tweaking the architecture enables a model to process tokens faster, it can train on more data in the same amount of time — so, after training, it should perform better than a slower model trained for the same amount of time. Therefore the best architecture is the one that, during training, processes the greatest amount of data within a given amount of time.

How it works: The authors built their model using a BERT-size transformer (110 million parameters), and they pretrained it on filtered data and fine-tuned it on the same benchmark dataset (GLUE). They modified the architecture, training data, and hyperparameters to improve training speed and efficiency.

Architecture: The authors enabled the architecture to process tokens faster during training while keeping its size nearly the same as BERT’s. The changes included disabling biases in attention and linear layers to compute gradients faster.
Training data: The authors trained the model on parts of The Pile and C4. They filtered according to a handcrafted heuristic that bore on tokenization: They removed documents in which the number of tokens was more than 3/10ths the number of characters. Because the dataset was bigger than the model would process in the time allowed, they fed it text with the most common tokens first, which it was more likely to learn well.
Hyperparameters: They adjusted the learning rate schedule to achieve lower loss toward the end of training. They also removed dropout, a technique to prevent overfitting, as overfitting was unlikely over such a short training duration.

Results: The authors’ model didn’t beat BERT, but it came within a few percentage points. For instance, it achieved 78.3 percent accuracy on General Language Understanding Evaluation (GLUE), while BERT achieved 80.9 percent accuracy. Trained using the same limited processing resources, the original BERT architecture achieved 52.0 percent. The authors found that the gains came mostly from architecture changes, followed by data changes, while hyperparameter changes had the least impact.

Why it matters: There’s room to optimize pretraining of LLMs. Careful attention to architecture, training data, and hyperparameters can yield powerful models even with severely limited computation.

We’re thinking: The work serves as a guide to training BERT-style models efficiently and a starting point to training modern transformers.

Subscribe to The Batch