Like LoRA, But for Pretraining GaLore, a memory-saving method for pretraining and fine-tuning LLMs

Published

Jul 10, 2024

Reading time

3 min read

Low-rank adaptation (LoRA) reduces memory requirements when fine-tuning large language models, but it isn’t as conducive to pretraining. Researchers devised a method that achieves similar memory savings but works well for both fine-tuning and pretraining.

What’s new: Jiawei Zhao and colleagues at California Institute of Technology, Meta, University of Texas at Austin, and Carnegie Mellon proposed Gradient Low-Rank Projection (GaLore), an optimizer modification that saves memory during training by reducing the sizes of optimizer states. They used this approach to pretrain a 7B parameter transformer using a consumer-grade Nvidia RTX 4090 GPU.

Key insight: LoRA saves memory during training by learning to approximate a change in the weight matrix of each layer in a neural network using the product of two smaller matrices. This approximation results in good performance when fine-tuning (though not quite as good as fine-tuning all weights) but worse performance when pretraining from a random initialization. The authors proved theoretically that updating weights according to an approximate gradient matrix — which reduces the memory required to store optimizer states — can yield the same performance as using the exact gradient matrix (at least for deep neural networks with ReLU activation functions and classification loss functions). Updating weights only once using an approximate gradient matrix is insufficient. However, updating weights repeatedly using gradient approximations that change with each training step (because the inputs change between training steps) achieves an effect similar to training weights in the usual way.

How it works: GaLore approximates a network’s gradient matrix divided into layer-wise matrices. Given a layer’s gradient matrix G (size m x n), GaLore computes a smaller matrix P (size r x m). It uses PG, a smaller approximation of the gradient matrix (size r x n), to update optimizer states. To further save memory, it updates layers one at a time instead of all at once, following LOMO.

At each training step, for each layer, GaLore computed the layer-wise gradient matrix normally.
GaLore computed a smaller matrix P that, when multiplied by the gradient matrix, yielded a smaller matrix that approximated the weight update. GaLore computed P every 200 training steps (that is, it used the same P for 200 training steps at a time before computing a new P).
GaLore multiplied P by the gradient matrix to compute a smaller, approximate version of the gradient matrix. It used this smaller version to update the Adam optimizer’s internal states, requiring less memory to store the optimizer’s internal states. Then the optimizer used its internal states to update the smaller matrix.
GaLore multiplied P by the smaller matrix to produce a full-sized approximation of the gradient matrix. It used the full-sized approximation to update the current layer’s weights.

Results: The authors tested GaLore in both pretraining and fine-tuning scenarios.

The authors compared GaLore to Adam while pretraining five transformer architectures from 60 million to 7 billion parameters to generate the next token in web text. GaLore (set up to represent its internal states using 8-bit numbers) pretrained LLaMA 7B from scratch using 22GB of memory, while Adam (modified to represent its internal states using 8-bit numbers) needed 46GB of memory. After training on 19.7 billion tokens, LLaMA 7B achieved 14.65 perplexity, while Adam achieved 14.61 perplexity (a measure of how well a model reproduces validation examples, lower is better).
They also used GaLore to fine-tune RoBERTaBase on the multi-task benchmark GLUE. GaLore needed 253MB of memory and achieved a score of 85.89 (averaging eight of 11 GLUE tasks), while LoRA needed 257MB of memory and reached 85.61.

Why it matters: LoRA’s ability to fine-tune large models using far less memory makes it a very popular fine-tuning method. GaLore is a theoretically motivated approach to memory-efficient training that’s good for both pretraining and fine-tuning.

We're thinking: LoRA-style approximation has been unlocking data- and memory-efficient approaches in a variety of machine learning situations — an exciting trend as models grow and demand for compute resources intensifies.

Subscribe to The Batch